[
https://issues.apache.org/jira/browse/ASTERIXDB-1114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14935764#comment-14935764
]
Murtadha Hubail commented on ASTERIXDB-1114:
--------------------------------------------
The issue was happening when a secondary index is registered before the primary
index (dataset) is. This would happen in the case of creating a secondary index
on a closed dataset. In that case, the newly created secondary index instance
is passed a null reference to the DatasetInfo since it was not registered yet.
When that index is flushed during a checkpoint on NC shutdown, the NPE happens.
This wouldn't happen in the case of primary index because we were resetting its
DatasetInfo reference during index registration.
The implemented fix guarantees that the passed DatasetInfo is never null to any
index.
Current recovery test cases always access the primary index first, so the issue
is not triggered. Since the issue happens on NC shutdown, there is no good way
to catch it. I have written an ugly test case that reproduces the issue and
verified that it fails before the fix was merged and doesn't afterwards. The
test case verifies that the NC was shutdown successfully by checking no
NCDriver processes are running. However, after thinking about it, this
verification might produce many false positives since an NCDriver process might
still be running due to a failure in a previous test case (since managix stop
doesn't necessarily succeed), so I rather not merge it.
Since the current fix eliminates the possibility of getting a null DatasetInfo,
I think this issue is fixed. When we have a more mature test framework, a test
case could be added to force the system to flush an index without having to
shutdown the NC.
> managix stop [instance_name] throws NullPointerException during checkpoint
> --------------------------------------------------------------------------
>
> Key: ASTERIXDB-1114
> URL: https://issues.apache.org/jira/browse/ASTERIXDB-1114
> Project: Apache AsterixDB
> Issue Type: Bug
> Reporter: Young-Seok Kim
> Assignee: Murtadha Hubail
>
> When AsterixDB instance is stopped after inserting a record, checkpoint from
> the recovery manager throws NullPointerException, which doesn't make NCDriver
> stop.
> The exception is shown below:
> INFO: Stopping NodeControllerService
> java.lang.NullPointerException
> at
> org.apache.asterix.common.context.BaseOperationTracker.beforeOperation(BaseOperationTracker.java:45)
> at
> org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.enterComponents(LSMHarness.java:178)
> at
> org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.getAndEnterComponents(LSMHarness.java:113)
> at
> org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.scheduleFlush(LSMHarness.java:376)
> at
> org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.scheduleFlush(LSMTreeIndexAccessor.java:122)
> at
> org.apache.asterix.common.context.DatasetLifecycleManager.flushAndWaitForIO(DatasetLifecycleManager.java:237)
> at
> org.apache.asterix.common.context.DatasetLifecycleManager.flushDatasetOpenIndexes(DatasetLifecycleManager.java:518)
> at
> org.apache.asterix.common.context.DatasetLifecycleManager.flushAllDatasets(DatasetLifecycleManager.java:439)
> at
> org.apache.asterix.transaction.management.service.recovery.RecoveryManager.checkpoint(RecoveryManager.java:379)
> at
> org.apache.asterix.hyracks.bootstrap.NCApplicationEntryPoint.stop(NCApplicationEntryPoint.java:138)
> at
> org.apache.hyracks.control.nc.NodeControllerService.stop(NodeControllerService.java:343)
> at org.apache.hyracks.control.nc.NCDriver$1.run(NCDriver.java:53)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)