[ 
https://issues.apache.org/jira/browse/KYLIN-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115922#comment-16115922
 ] 

Alexander Sterligov edited comment on KYLIN-2749 at 8/6/17 8:29 PM:
--------------------------------------------------------------------

Sorry for long response.

*The problem seams to be starting from*
2017-07-22 09:22:34,544 INFO  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-261] 
client.HBaseAdmin:965 : Deleted KYLIN_7MMHCHKVVB
2017-07-22 09:22:34,545 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-261] 
steps.MergeGCStep:86 : Dropped htable: KYLIN_7MMHCHKVVB

*Then the job is successfully finished:*
2017-07-22 09:25:41,056 INFO  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-261] 
execution.ExecutableManager:396 : job id:1655adfa-112e-4ed0-b44e-75a021ec6fcb 
from RUNNING to SUCCEED

*You can see no errors in between*

*After 30 minutes Kylin was shut down:*
2017-07-22 09:53:47,666 INFO  [Thread-11] threadpool.DefaultScheduler:228 : 
Shutingdown Job Engine ....
2017-07-22 09:55:06,622 WARN  [localhost-startStop-1] common.KylinConfig:232 : 
KYLIN_CONF property was not set, will seek KYLIN_HOME env variable

*Then merge job is started again:*
2017-07-22 09:55:32,257 INFO  [pool-7-thread-1] threadpool.DefaultScheduler:114 
: CubingJob{id=1655adfa-112e-4ed0-b44e-75a021ec6fcb, name=cohort_main - 
20170715000000_20170722000000 - MERGE - GMT+03:00 2017-07-22 12:11:32, 
state=READY} prepare to schedule
2017-07-22 09:55:32,259 INFO  [pool-7-thread-1] threadpool.DefaultScheduler:117 
: CubingJob{id=1655adfa-112e-4ed0-b44e-75a021ec6fcb, name=cohort_main - 
20170715000000_20170722000000 - MERGE - GMT+03:00 2017-07-22 12:11:32, 
state=READY} scheduled

*Then it crashes:*
2017-07-22 09:55:32,546 ERROR [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.AbstractExecutable:126 : error running Executable: 
MergeDictionaryStep{id=1655adfa-112e-4ed0-b44e-75a021ec6fcb-00, name=Merge 
Cuboid Dictionary, state=RUNNING}
2017-07-22 09:55:32,547 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
dao.ExecutableDao:217 : updating job output, id: 
1655adfa-112e-4ed0-b44e-75a021ec6fcb-00
2017-07-22 09:55:32,551 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
dao.ExecutableDao:217 : updating job output, id: 
1655adfa-112e-4ed0-b44e-75a021ec6fcb-00
2017-07-22 09:55:32,554 INFO  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.ExecutableManager:396 : job 
id:1655adfa-112e-4ed0-b44e-75a021ec6fcb-00 from RUNNING to ERROR
2017-07-22 09:55:32,555 ERROR [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.AbstractExecutable:126 : error running Executable: 
CubingJob{id=1655adfa-112e-4ed0-b44e-75a021ec6fcb, name=cohort_main - 
20170715000000_20170722000000 - MERGE - GMT+03:00 2017-07-22 12:11:32, 
state=RUNNING}
2017-07-22 09:55:32,556 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
dao.ExecutableDao:217 : updating job output, id: 
1655adfa-112e-4ed0-b44e-75a021ec6fcb
2017-07-22 09:55:32,559 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
dao.ExecutableDao:217 : updating job output, id: 
1655adfa-112e-4ed0-b44e-75a021ec6fcb
2017-07-22 09:55:32,561 INFO  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.ExecutableManager:396 : job id:1655adfa-112e-4ed0-b44e-75a021ec6fcb 
from RUNNING to ERROR
2017-07-22 09:55:32,562 WARN  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.AbstractExecutable:258 : no need to send email, user list is empty
2017-07-22 09:55:32,562 ERROR [pool-8-thread-1] threadpool.DefaultScheduler:146 
: ExecuteException job:1655adfa-112e-4ed0-b44e-75a021ec6fcb
org.apache.kylin.job.exception.ExecuteException: 
org.apache.kylin.job.exception.ExecuteException: java.lang.NullPointerException
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:134)
        at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:142)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kylin.job.exception.ExecuteException: 
java.lang.NullPointerException
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:134)
        at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:64)
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:124)
        ... 4 more
Caused by: java.lang.NullPointerException
        at 
org.apache.kylin.engine.mr.steps.MergeDictionaryStep.mergeDictionaries(MergeDictionaryStep.java:148)
        at 
org.apache.kylin.engine.mr.steps.MergeDictionaryStep.makeDictForNewSegment(MergeDictionaryStep.java:136)
        at 
org.apache.kylin.engine.mr.steps.MergeDictionaryStep.doWork(MergeDictionaryStep.java:68)
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:124)
        ... 6 more

*Query stopped working*
2017-07-22 09:55:50,039 ERROR [pool-13-thread-17] v2.CubeHBaseEndpointRPC:235 : 
<sub-thread for Query 1ff0041f-d5c3-4848-b31a-d8d1727b4463 GTScanRequest 
37ccdf2a>Error when visiting cubes by endpoint
org.apache.hadoop.hbase.TableNotFoundException: Table 'KYLIN_7MMHCHKVVB' was 
not found, got: KYLIN_7H3WSPX1UJ.

So it looks like after restart Kylin got metadata like it was not written 
before. Metadata both for job and for cube was not written, but hbase tables 
for segments has been changed. It looks like there are some flush issues in 
metastore.


was (Author: sterligovak):
Sorry for long response.

*The problem seams to be starting from*
2017-07-22 09:22:34,544 INFO  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-261] 
client.HBaseAdmin:965 : Deleted KYLIN_7MMHCHKVVB
2017-07-22 09:22:34,545 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-261] 
steps.MergeGCStep:86 : Dropped htable: KYLIN_7MMHCHKVVB

*Then it is successfully finished:*
2017-07-22 09:25:41,056 INFO  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-261] 
execution.ExecutableManager:396 : job id:1655adfa-112e-4ed0-b44e-75a021ec6fcb 
from RUNNING to SUCCEED

*You can see no errors in between*

*After 30 minutes Kylin was shut down:*
2017-07-22 09:53:47,666 INFO  [Thread-11] threadpool.DefaultScheduler:228 : 
Shutingdown Job Engine ....
2017-07-22 09:55:06,622 WARN  [localhost-startStop-1] common.KylinConfig:232 : 
KYLIN_CONF property was not set, will seek KYLIN_HOME env variable

*Then merge job is started again:*
2017-07-22 09:55:32,257 INFO  [pool-7-thread-1] threadpool.DefaultScheduler:114 
: CubingJob{id=1655adfa-112e-4ed0-b44e-75a021ec6fcb, name=cohort_main - 
20170715000000_20170722000000 - MERGE - GMT+03:00 2017-07-22 12:11:32, 
state=READY} prepare to schedule
2017-07-22 09:55:32,259 INFO  [pool-7-thread-1] threadpool.DefaultScheduler:117 
: CubingJob{id=1655adfa-112e-4ed0-b44e-75a021ec6fcb, name=cohort_main - 
20170715000000_20170722000000 - MERGE - GMT+03:00 2017-07-22 12:11:32, 
state=READY} scheduled

*Then it crashes:*
2017-07-22 09:55:32,546 ERROR [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.AbstractExecutable:126 : error running Executable: 
MergeDictionaryStep{id=1655adfa-112e-4ed0-b44e-75a021ec6fcb-00, name=Merge 
Cuboid Dictionary, state=RUNNING}
2017-07-22 09:55:32,547 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
dao.ExecutableDao:217 : updating job output, id: 
1655adfa-112e-4ed0-b44e-75a021ec6fcb-00
2017-07-22 09:55:32,551 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
dao.ExecutableDao:217 : updating job output, id: 
1655adfa-112e-4ed0-b44e-75a021ec6fcb-00
2017-07-22 09:55:32,554 INFO  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.ExecutableManager:396 : job 
id:1655adfa-112e-4ed0-b44e-75a021ec6fcb-00 from RUNNING to ERROR
2017-07-22 09:55:32,555 ERROR [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.AbstractExecutable:126 : error running Executable: 
CubingJob{id=1655adfa-112e-4ed0-b44e-75a021ec6fcb, name=cohort_main - 
20170715000000_20170722000000 - MERGE - GMT+03:00 2017-07-22 12:11:32, 
state=RUNNING}
2017-07-22 09:55:32,556 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
dao.ExecutableDao:217 : updating job output, id: 
1655adfa-112e-4ed0-b44e-75a021ec6fcb
2017-07-22 09:55:32,559 DEBUG [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
dao.ExecutableDao:217 : updating job output, id: 
1655adfa-112e-4ed0-b44e-75a021ec6fcb
2017-07-22 09:55:32,561 INFO  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.ExecutableManager:396 : job id:1655adfa-112e-4ed0-b44e-75a021ec6fcb 
from RUNNING to ERROR
2017-07-22 09:55:32,562 WARN  [Job 1655adfa-112e-4ed0-b44e-75a021ec6fcb-197] 
execution.AbstractExecutable:258 : no need to send email, user list is empty
2017-07-22 09:55:32,562 ERROR [pool-8-thread-1] threadpool.DefaultScheduler:146 
: ExecuteException job:1655adfa-112e-4ed0-b44e-75a021ec6fcb
org.apache.kylin.job.exception.ExecuteException: 
org.apache.kylin.job.exception.ExecuteException: java.lang.NullPointerException
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:134)
        at 
org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRunner.run(DefaultScheduler.java:142)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.kylin.job.exception.ExecuteException: 
java.lang.NullPointerException
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:134)
        at 
org.apache.kylin.job.execution.DefaultChainedExecutable.doWork(DefaultChainedExecutable.java:64)
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:124)
        ... 4 more
Caused by: java.lang.NullPointerException
        at 
org.apache.kylin.engine.mr.steps.MergeDictionaryStep.mergeDictionaries(MergeDictionaryStep.java:148)
        at 
org.apache.kylin.engine.mr.steps.MergeDictionaryStep.makeDictForNewSegment(MergeDictionaryStep.java:136)
        at 
org.apache.kylin.engine.mr.steps.MergeDictionaryStep.doWork(MergeDictionaryStep.java:68)
        at 
org.apache.kylin.job.execution.AbstractExecutable.execute(AbstractExecutable.java:124)
        ... 6 more

*Query stopped working*
2017-07-22 09:55:50,039 ERROR [pool-13-thread-17] v2.CubeHBaseEndpointRPC:235 : 
<sub-thread for Query 1ff0041f-d5c3-4848-b31a-d8d1727b4463 GTScanRequest 
37ccdf2a>Error when visiting cubes by endpoint
org.apache.hadoop.hbase.TableNotFoundException: Table 'KYLIN_7MMHCHKVVB' was 
not found, got: KYLIN_7H3WSPX1UJ.

So it looks like after restart Kylin got metadata like it was not written 
before. Metadata both for job and for cube was not written, but hbase tables 
for segments has been changed. It looks like there are some flush issues in 
metastore.

> Merge may remove old segments without saving merged segment
> -----------------------------------------------------------
>
>                 Key: KYLIN-2749
>                 URL: https://issues.apache.org/jira/browse/KYLIN-2749
>             Project: Kylin
>          Issue Type: Bug
>    Affects Versions: v2.0.0
>            Reporter: Alexander Sterligov
>         Attachments: kylin.log.2017-07-22.tar
>
>
> Merge started to work on last 7 segments.
> During the process hbase had a failure because of spot-instances shutdown in 
> Amazon. Data was not lost, because it is at S3.
> I stopped kylin and did hbase hbck --repair. During the report of repair I 
> didn't see any information about lost data, just redistribution of regions.
> Then after kylin was started I cannot query data from the last  7 segments:
> {quote}
> Caused by: java.lang.RuntimeException: 
> org.apache.hadoop.hbase.TableNotFoundException: Table 'KYLIN_7MMHCHKVVB' was 
> not found, got: KYLIN_7H3WSPX1UJ.
>         at com.google.common.base.Throwables.propagate(Throwables.java:160)
>         at 
> org.apache.kylin.storage.hbase.cube.v2.ExpectedSizeIterator.next(ExpectedSizeIterator.java:67)
>         at 
> org.apache.kylin.storage.hbase.cube.v2.ExpectedSizeIterator.next(ExpectedSizeIterator.java:31)
>         at 
> com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
>         at com.google.common.collect.Iterators$6.hasNext(Iterators.java:583)
>         at 
> org.apache.kylin.storage.gtrecord.SegmentCubeTupleIterator$2.hasNext(SegmentCubeTupleIterator.java:116)
>         at 
> org.apache.kylin.storage.gtrecord.SegmentCubeTupleIterator.hasNext(SegmentCubeTupleIterator.java:149)
>         at com.google.common.collect.Iterators$6.hasNext(Iterators.java:582)
>         at 
> org.apache.kylin.storage.gtrecord.SequentialCubeTupleIterator.hasNext(SequentialCubeTupleIterator.java:129)
>         at 
> org.apache.kylin.query.enumerator.OLAPEnumerator.moveNext(OLAPEnumerator.java:67)
>         at Baz$1$1.moveNext(Unknown Source)
>         at 
> org.apache.calcite.linq4j.EnumerableDefaults.groupBy_(EnumerableDefaults.java:826)
>         at 
> org.apache.calcite.linq4j.EnumerableDefaults.groupBy(EnumerableDefaults.java:761)
>         at 
> org.apache.calcite.linq4j.DefaultEnumerable.groupBy(DefaultEnumerable.java:302)
>         at Baz.bind(Unknown Source)
>         at 
> org.apache.calcite.jdbc.CalcitePrepare$CalciteSignature.enumerable(CalcitePrepare.java:331)
>         at 
> org.apache.calcite.jdbc.CalciteConnectionImpl.enumerable(CalciteConnectionImpl.java:294)
>         at 
> org.apache.calcite.jdbc.CalciteMetaImpl._createIterable(CalciteMetaImpl.java:553)
>         at 
> org.apache.calcite.jdbc.CalciteMetaImpl.createIterable(CalciteMetaImpl.java:544)
>         at 
> org.apache.calcite.avatica.AvaticaResultSet.execute(AvaticaResultSet.java:193)
>         at 
> org.apache.calcite.jdbc.CalciteResultSet.execute(CalciteResultSet.java:67)
>         at 
> org.apache.calcite.jdbc.CalciteResultSet.execute(CalciteResultSet.java:44)
>         at 
> org.apache.calcite.avatica.AvaticaConnection$1.execute(AvaticaConnection.java:607)
>         at 
> org.apache.calcite.jdbc.CalciteMetaImpl.prepareAndExecute(CalciteMetaImpl.java:600)
>         at 
> org.apache.calcite.avatica.AvaticaConnection.prepareAndExecuteInternal(AvaticaConnection.java:615)
>         at 
> org.apache.calcite.avatica.AvaticaStatement.executeInternal(AvaticaStatement.java:148)
>         ... 77 more
> Caused by: org.apache.hadoop.hbase.TableNotFoundException: Table 
> 'KYLIN_7MMHCHKVVB' was not found, got: KYLIN_7H3WSPX1UJ.
>         at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1310)
>         at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1189)
>         at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1173)
>         at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1130)
>         at 
> org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.getRegionLocation(ConnectionManager.java:965)
>         at 
> org.apache.hadoop.hbase.client.HRegionLocator.getRegionLocation(HRegionLocator.java:83)
>         at 
> org.apache.hadoop.hbase.client.HTable.getRegionLocation(HTable.java:505)
>         at 
> org.apache.hadoop.hbase.client.HTable.getKeysAndRegionsInRange(HTable.java:721)
>         at 
> org.apache.hadoop.hbase.client.HTable.getKeysAndRegionsInRange(HTable.java:691)
>         at 
> org.apache.hadoop.hbase.client.HTable.getStartKeysInRange(HTable.java:1796)
>         at 
> org.apache.hadoop.hbase.client.HTable.coprocessorService(HTable.java:1751)
>         at 
> org.apache.kylin.storage.hbase.cube.v2.CubeHBaseEndpointRPC$1.run(CubeHBaseEndpointRPC.java:182)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         ... 1 more
> {quote}
> hbase really doesn't contain tables for the last 7 segments and I didn't call 
> any cleanup jobs.
> It looks like Merge removed old segment tables before merged segment was 
> saved.
> I'm going to continue to investigate this problem and will post more details 
> next week.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to