Taewoo Kim created ASTERIXDB-2145:
-------------------------------------
Summary: Recovery process fails on 100 datasets
Key: ASTERIXDB-2145
URL: https://issues.apache.org/jira/browse/ASTERIXDB-2145
Project: Apache AsterixDB
Issue Type: Bug
Reporter: Taewoo Kim
On the Cloudberry DB, currently, there are 112 datasets on a dataverse. When
restarting that instance, the NC showed the following error and stopped.
java.lang.IllegalStateException: Failed to redo
at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:712)
at
org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:378)
at
org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:187)
at
org.apache.asterix.app.nc.RecoveryManager.startLocalRecovery(RecoveryManager.java:179)
at
org.apache.asterix.app.nc.task.LocalRecoveryTask.perform(LocalRecoveryTask.java:43)
at
org.apache.asterix.app.replication.message.StartupTaskResponseMessage.handle(StartupTaskResponseMessage.java:56)
at
org.apache.asterix.messaging.NCMessageBroker.receivedMessage(NCMessageBroker.java:92)
at
org.apache.hyracks.control.nc.work.ApplicationMessageWork.run(ApplicationMessageWork.java:51)
at
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127)
Caused by: org.apache.hyracks.api.exceptions.HyracksDataException:
Cannot allocate dataset 191 memory since memory budget would be
exceeded.
at
org.apache.asterix.common.context.DatasetLifecycleManager.allocateMemory(DatasetLifecycleManager.java:568)
at
org.apache.hyracks.storage.common.buffercache.ResourceHeapBufferAllocator.reserveAllocation(ResourceHeapBufferAllocator.java:53)
at
org.apache.hyracks.storage.am.lsm.common.impls.VirtualBufferCache.open(VirtualBufferCache.java:307)
at
org.apache.hyracks.storage.am.lsm.common.impls.MultitenantVirtualBufferCache.open(MultitenantVirtualBufferCache.java:119)
at
org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTree.allocateMemoryComponent(LSMBTree.java:611)
at
org.apache.hyracks.storage.am.lsm.common.impls.AbstractLSMIndex.allocateMemoryComponents(AbstractLSMIndex.java:389)
at
org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:421)
at
org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.forceModify(LSMHarness.java:368)
at
org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.forceUpsert(LSMTreeIndexAccessor.java:181)
at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:707)
... 8 more
So, I increased the storage.memorycomponent.globalbudget parameter from 3GB to
5GB. Still, the NC showed the following error and the recovery process could
not finish.
... similar log records ...
Oct 25, 2017 9:33:44 AM
org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
loadDataverse
INFO: Loading dataverse:berry
Oct 25, 2017 9:33:44 AM
org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
loadIndex
INFO: Loading index:meta_idx_meta
Oct 25, 2017 9:33:44 AM
org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
loadIndex
INFO: Resource loaded 161:storage/partition_1/berry/meta_idx_meta
Oct 25, 2017 9:34:09 AM org.apache.hyracks.util.ExitUtil$ExitThread run
INFO: JVM exiting with status 2; bye!
So, I checked the parameter information page and found that the default
parameter for storage.memorycomponent.numpages is 1/16 of the global component
budget. Therefore, I decreased this parameter to increase the number of
datasets in memory. And the instance was finally able to start. So, it seems
that the recovery process tries to load and keep all datasets into memory and
this needs to be checked.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)