[ 
https://issues.apache.org/jira/browse/KYLIN-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999869#comment-16999869
 ] 

Liang Geng commented on KYLIN-4307:
-----------------------------------

Dear all,

I have come up with two solution to resolve this issue and minimize the 
modifications. The first one is raise the WriteConflictException conditional, 
like this: 

 
{code:java}
protected long checkAndPutResourceImpl(String resPath, byte[] content, long 
oldTS, long newTS)
        throws IOException, IllegalStateException {
    try {
        .......
        boolean ok = table.checkAndPut(row, B_FAMILY, B_COLUMN_TS, bOldTS, put);
        logger.trace("Update row {} from oldTs: {}, to newTs: {}, operation 
result: {}", resPath, oldTS, newTS, ok);
        if (!ok) {
            long real = getResourceTimestampImpl(resPath);
            if(newTS != real)
                throw new WriteConflictException(
                        "Overwriting conflict " + resPath + ", expect old TS " 
+ oldTS + ", but it is " + real);
            else
                logger.warn("Potential Overwriting conflict " + resPath + ", 
expect old TS " + oldTS + ", but it is " + real);
        }
    } ......
}{code}
But this solution has a chance to prune the conflict when multiple writers are 
trying to write with same new timestamp.

Another one looks more reasonable: throw the exception conditional.

 
{code:java}
protected long checkAndPutResourceImpl(String resPath, byte[] content, long 
oldTS, long newTS)
        throws IOException, IllegalStateException {
    try {
        ......
        boolean ok = table.checkAndPut(row, B_FAMILY, B_COLUMN_TS, bOldTS, put);
        logger.trace("Update row {} from oldTs: {}, to newTs: {}, operation 
result: {}", resPath, oldTS, newTS, ok);
        if (!ok) {
            long real = getResourceTimestampImpl(resPath);
            throw new WriteConflictException(
                    "Overwriting conflict " + resPath + ", expect old TS " + 
oldTS + ", but it is " + real);
        }
    } catch (Exception ex) {
        if (pushdown != null)
            pushdown.rollback();
        // don't throw the exception if the write operation performed
        if (checkShouldThrow(resPath, newTS))
            throw ex;
        else
            logger.warn("Exception occurred but the checkAndPut seems 
successful", ex);
    } finally {
        if (pushdown != null)
            pushdown.close();
        IOUtils.closeQuietly(table);
    }
    return newTS;
}

private boolean checkShouldThrow(String resPath, long newTS) {
    boolean shouldThrow = true;
    try {
        long real = getResourceTimestampImpl(resPath);
        if (real == newTS) {
            shouldThrow = false;
        }
    } catch (Exception e) {
        logger.error("Failed to get last modified timestamp", e);
    }
    return shouldThrow;
}
{code}
 

This solution may also invalid if any of the two scenarios below happened.
 # if getResourceTimestampImpl raises any exception or
 # If some writer modified with the same timestamp as newTS before calling 
getResourceTimestampImpl

 

 

> ExponentialBackoffRetry implemented a incorrect retry policy
> ------------------------------------------------------------
>
>                 Key: KYLIN-4307
>                 URL: https://issues.apache.org/jira/browse/KYLIN-4307
>             Project: Kylin
>          Issue Type: Bug
>          Components: Storage - HBase
>    Affects Versions: v2.6.4
>         Environment: HBase 1.1.5
>            Reporter: Liang Geng
>            Priority: Major
>         Attachments: Screen Shot 2019-12-19 at 12.16.05.png
>
>
> Today we had encountered a build error. The step "Build N-Dimension Cuboid : 
> level 1" failed caused by WriteConflictException.
>  
>  
> {code:java}
> 2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696795196, to newTs: 1576696799048, operation result: 
> true2019-12-19 03:19:59,049 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696795196, to newTs: 1576696799048, operation result: 
> true2019-12-19 03:20:02,775 INFO  [BadQueryDetector] 
> service.BadQueryDetector:147 : Detect bad query.2019-12-19 03:20:03,254 TRACE 
> [Scheduler 1528161485 Job d33e65d0-4214-3b96-9256-607a730cb53f-880] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from oldTs: 
> 1576696793142, to newTs: 1576696803253, operation result: true2019-12-19 
> 03:20:03,256 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696803253, to newTs: 1576696803255, operation result: 
> true2019-12-19 03:20:04,187 INFO  [FetcherRunner 2062551616-143] 
> threadpool.DefaultFetcherRunner:85 : Job Fetcher: 2 should running, 2 actual 
> running, 0 stopped, 0 ready, 1754 already succeed, 7 error, 10 discarded, 0 
> others2019-12-19 03:20:09,124 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696799048, to newTs: 1576696809122, operation result: 
> true2019-12-19 03:20:09,125 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696809122, to newTs: 1576696809124, operation result: 
> true2019-12-19 03:20:13,263 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696803255, to newTs: 1576696813262, operation result: 
> true2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 
> seconds.2019-12-19 03:20:25,228 TRACE [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from 
> oldTs: 1576696809124, to newTs: 1576696819131, operation result: 
> false2019-12-19 03:20:25,284 TRACE [Scheduler 1528161485 Job 
> d33e65d0-4214-3b96-9256-607a730cb53f-880] hbase.HBaseResourceStore:335 : 
> Update row /execute_output/d33e65d0-4214-3b96-9256-607a730cb53f-10 from 
> oldTs: 1576696813262, to newTs: 1576696825245, operation result: 
> true2019-12-19 03:20:25,299 ERROR [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : 
> error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, 
> name=Build N-Dimension Cuboid : level 1, 
> state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: 
> Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, 
> expect old TS 1576696809124, but it is 1576696819131 at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
>  at 
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
>  at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}
>  
> At 03:20:24,133, a thread "Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879" is trying to write content to the 
> path "/execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08" but the 
> exception was raised for some reasons. We don't know what kind of exception 
> happened because the exception was eaten by doWithRetry.
> ExponentialBackoffRetry will retry if the following exceptions happened: 
> SocketTimeoutException, or ConnectException, or RetriesExhaustedException or 
> any other exceptions configured in 
> "kylin.resourcestore.connection-exceptions".
> By default the retry policy is disabled, but we have enabled it by setting 
> "kylin.resourcestore.reconnect-enabled".
>  
> {code:java}
> // A successful written, 1576696809122 -> 1576696809124// A successful 
> written, 1576696809122 -> 15766968091242019-12-19 03:20:09,125 TRACE 
> [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
> 1576696809122, to newTs: 1576696809124, operation result: true
> // A written with exception but actually persistence to HBase, 1576696809124 
> -> 1576696819131// ExponentialBackoffRetry.doWithRetry catched the exception, 
> and then back-off2019-12-19 03:20:24,133 INFO  [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> persistence.ExponentialBackoffRetry:87 : Will try to re-connect after 1 
> seconds.
> // Retry as back-off time reached. But this time, the invoking of checkAndPut 
> failed because the content is already written.2019-12-19 03:20:25,228 TRACE 
> [Scheduler 1528161485 Job 4759f195-9443-2604-5358-d5804d9aaefc-879] 
> hbase.HBaseResourceStore:335 : Update row 
> /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08 from oldTs: 
> 1576696809124, to newTs: 1576696819131, operation result: false
> // checkAndPutResourceImpl throws WriteConflictException2019-12-19 
> 03:20:25,299 ERROR [Scheduler 1528161485 Job 
> 4759f195-9443-2604-5358-d5804d9aaefc-879] common.MapReduceExecutable:198 : 
> error execute MapReduceExecutable{id=4759f195-9443-2604-5358-d5804d9aaefc-08, 
> name=Build N-Dimension Cuboid : level 1, 
> state=RUNNING}org.apache.kylin.common.persistence.WriteConflictException: 
> Overwriting conflict /execute_output/4759f195-9443-2604-5358-d5804d9aaefc-08, 
> expect old TS 1576696809124, but it is 1576696819131 at 
> org.apache.kylin.storage.hbase.HBaseResourceStore.checkAndPutResourceImpl(HBaseResourceStore.java:338)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.lambda$checkAndPutResourceWithRetry$3(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ExponentialBackoffRetry.doWithRetry(ExponentialBackoffRetry.java:52)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceWithRetry(ResourceStore.java:423)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResourceCheckpoint(ResourceStore.java:411)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:405)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:388)
>  at 
> org.apache.kylin.common.persistence.ResourceStore.checkAndPutResource(ResourceStore.java:368)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.writeJobOutputResource(ExecutableDao.java:252)
>  at 
> org.apache.kylin.job.dao.ExecutableDao.updateJobOutput(ExecutableDao.java:426)
>  at 
> org.apache.kylin.job.execution.ExecutableManager.addJobInfo(ExecutableManager.java:566)
>  at 
> org.apache.kylin.engine.mr.common.MapReduceExecutable.doWork(MapReduceExecutable.java:163)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:71)
>  at 
> org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:167)
>  at 
> org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:114)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> {code}
>  
> As shown in the log, the thread failed to call checkAndPutResourceImpl and 
> sleep 1 second to back-off at "2019-12-19 03:20:25,228".
> At 03:20:25,299, the WriteConflictException was raised. The message is 
> "expect old TS 1576696809124, but it is 1576696819131". Obviously, the thread 
> is trying to update with the same timestamp. That proves the first written is 
> successful but throws exceptions. And then, that exception was caught by 
> doWithRetry.
> I'm not a HBase expert, the picture above shows my rough guess:  !Screen Shot 
> 2019-12-19 at 12.16.05.png!
> Thus, we think ExponentialBackoffRetry implemented an incorrect retry policy. 
> Determining whether to retry only by the type of exception is not reliable. 
> At least, we should pull the last modified time to verify the written is 
> successful or not. Right?
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to