Taewoo Kim created ASTERIXDB-2145:
-------------------------------------

             Summary: Recovery process fails on 100 datasets
                 Key: ASTERIXDB-2145
                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2145
             Project: Apache AsterixDB
          Issue Type: Bug
            Reporter: Taewoo Kim


On the Cloudberry DB, currently, there are 112 datasets on a dataverse. When 
restarting that instance, the NC showed the following error and stopped. 

java.lang.IllegalStateException: Failed to redo
at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:712)
at 
org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:378)
at 
org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:187)
at 
org.apache.asterix.app.nc.RecoveryManager.startLocalRecovery(RecoveryManager.java:179)
at 
org.apache.asterix.app.nc.task.LocalRecoveryTask.perform(LocalRecoveryTask.java:43)
at 
org.apache.asterix.app.replication.message.StartupTaskResponseMessage.handle(StartupTaskResponseMessage.java:56)
at 
org.apache.asterix.messaging.NCMessageBroker.receivedMessage(NCMessageBroker.java:92)
at 
org.apache.hyracks.control.nc.work.ApplicationMessageWork.run(ApplicationMessageWork.java:51)
at 
org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127)
Caused by: org.apache.hyracks.api.exceptions.HyracksDataException:
Cannot allocate dataset 191 memory since memory budget would be
exceeded.
at 
org.apache.asterix.common.context.DatasetLifecycleManager.allocateMemory(DatasetLifecycleManager.java:568)
at 
org.apache.hyracks.storage.common.buffercache.ResourceHeapBufferAllocator.reserveAllocation(ResourceHeapBufferAllocator.java:53)
at 
org.apache.hyracks.storage.am.lsm.common.impls.VirtualBufferCache.open(VirtualBufferCache.java:307)
at 
org.apache.hyracks.storage.am.lsm.common.impls.MultitenantVirtualBufferCache.open(MultitenantVirtualBufferCache.java:119)
at 
org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTree.allocateMemoryComponent(LSMBTree.java:611)
at 
org.apache.hyracks.storage.am.lsm.common.impls.AbstractLSMIndex.allocateMemoryComponents(AbstractLSMIndex.java:389)
at 
org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:421)
at 
org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.forceModify(LSMHarness.java:368)
at 
org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.forceUpsert(LSMTreeIndexAccessor.java:181)
at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:707)
... 8 more

So, I increased the storage.memorycomponent.globalbudget parameter from 3GB to 
5GB. Still, the NC showed the following error and the recovery process could 
not finish. 

... similar log records ...
Oct 25, 2017 9:33:44 AM 
org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
 loadDataverse
INFO: Loading dataverse:berry
Oct 25, 2017 9:33:44 AM 
org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
 loadIndex
INFO: Loading index:meta_idx_meta
Oct 25, 2017 9:33:44 AM 
org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
 loadIndex
INFO: Resource loaded 161:storage/partition_1/berry/meta_idx_meta
Oct 25, 2017 9:34:09 AM org.apache.hyracks.util.ExitUtil$ExitThread run
INFO: JVM exiting with status 2; bye!

So, I checked the parameter information page and found that the default 
parameter for storage.memorycomponent.numpages is 1/16 of the global component 
budget. Therefore, I decreased this parameter to increase the number of 
datasets in memory. And the instance was finally able to start. So, it seems 
that the recovery process tries to load and keep all datasets into memory and 
this needs to be checked.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to