It would good to get thread dumps if this happens again. On Mon, May 16, 2016 at 10:56 PM Jianfeng Jia <jianfeng....@gmail.com> wrote:
> I revisited the logs, and luckily it hasn’t been cleared. Here is part of > the nc1’s log: > > May 15, 2016 1:04:10 PM > org.apache.hyracks.storage.common.buffercache.BufferCache openFile > INFO: Opening file: 14 in cache: > org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 > May 15, 2016 1:04:10 PM > org.apache.hyracks.storage.common.buffercache.BufferCache openFile > INFO: Opening file: 13 in cache: > org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 > May 15, 2016 1:04:10 PM > org.apache.hyracks.storage.common.buffercache.BufferCache createFile > INFO: Creating file: > /nc1/iodevice1/storage/partition_0/hackathon/log_device_idx_log_device/2016-05-15-12-56-48-712_2016-05-15-12-23-31-225_f > in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 > May 15, 2016 1:04:10 PM > org.apache.hyracks.storage.common.buffercache.BufferCache openFile > INFO: Opening file: 15 in cache: > org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 > —————————————————————————————————————— > /// I shut down the cluster from here and start the server right away. > —————————————————————————————————————— > May 15, 2016 1:43:12 PM > org.apache.asterix.transaction.management.service.recovery.RecoveryManager > startRecoveryRedoPhase > INFO: Logs REDO phase completed. Redo logs count: 1197 > May 15, 2016 1:43:12 PM > org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness > flush > INFO: Started a flush operation for index: LSMBTree > [/nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/] ... > May 15, 2016 1:43:12 PM > org.apache.hyracks.storage.common.buffercache.BufferCache createFile > INFO: Creating file: > /nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/2016-05-15-13-43-12-680_2016-05-15-13-43-12-680_f > in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 > > No logs generated in that 43mins. During that time one CPU was exhaustive > and I remember no file was touched or generated in the asterix foler. Then > it may not be the problem of the buffercache in the recovery phase? > > > > > > > On May 16, 2016, at 9:28 PM, Mike Carey <dtab...@gmail.com> wrote: > > > > Agreed and agreed. But is the spinning on recovery? > > > > (What's the role of the buffer cache during recovery?) > > > > > > On 5/17/16 2:10 AM, Jianfeng Jia wrote: > >> I think the BuffeCache is the core issue, the recovery process may just > run into the same spin trap where it was stopped. > >> And I create another issue that we should be able to Abort the task so > that we don’t need to restart the server. > >> > >>> On May 16, 2016, at 7:24 AM, Michael Blow <mblow.apa...@gmail.com> > wrote: > >>> > >>> This might be related: (ASTERIXDB-1438) BufferCache spins indefinitely > when > >>> cache is exceeded. > >>> > >>> https://issues.apache.org/jira/browse/ASTERIXDB-1438 > >>> > >>> Thanks, > >>> > >>> -MDB > >>> > >>> On Mon, May 16, 2016 at 1:52 AM Mike Carey <dtab...@gmail.com> wrote: > >>> > >>>> Glad it worked out - can someone also capture the core issue in > JIRA? Thx! > >>>> On May 15, 2016 11:40 PM, "Jianfeng Jia" <jianfeng....@gmail.com> > wrote: > >>>> > >>>>> Great! The server is back now. Thanks a lot! > >>>>>> On May 15, 2016, at 2:26 PM, Murtadha Hubail <hubail...@gmail.com> > >>>>> wrote: > >>>>>> You can delete the existing log files and create new empty ones with > >>>>> incremented log file number, but it is very important that you don't > >>>>> delete the checkpoint file. > >>>>>> Of course any data in the old log files will be lost, but the data > >>>>> already on disk will be available. > >>>>>>> On May 15, 2016, at 1:23 PM, Jianfeng Jia <jianfeng....@gmail.com> > >>>>> wrote: > >>>>>>> Hi, > >>>>>>> We submitted a long running join+insert query and stop the cluster > to > >>>>> stop running it. However, when it restarted it ran the recovery > forever, > >>>>>>> the logs shows that it is creating a lot of buffer cache. > >>>>>>> > >>>>>>> In order to bring the cluster back to answer the query, is there > any > >>>>> hacking solutions? such as remove the recovery txnlogs? I’m worried > that > >>>> it > >>>>> will ruin the cluster somehow. > >>>>>>> We are in a contest so any early helps are really appreciated! > Thanks! > >>>>>>> > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> Jianfeng Jia > >>>>>>> PhD Candidate of Computer Science > >>>>>>> University of California, Irvine > >>>>>>> > >>>>> > >>>>> > >>>>> Best, > >>>>> > >>>>> Jianfeng Jia > >>>>> PhD Candidate of Computer Science > >>>>> University of California, Irvine > >>>>> > >>>>> > >> > >> > >> Best, > >> > >> Jianfeng Jia > >> PhD Candidate of Computer Science > >> University of California, Irvine > >> > >> > > > > > > Best, > > Jianfeng Jia > PhD Candidate of Computer Science > University of California, Irvine > >