I’m trying to reproduce the issue and will dump it then. > On May 16, 2016, at 10:57 PM, Michael Blow <mblow.apa...@gmail.com> wrote: > > It would good to get thread dumps if this happens again. > On Mon, May 16, 2016 at 10:56 PM Jianfeng Jia <jianfeng....@gmail.com> > wrote: > >> I revisited the logs, and luckily it hasn’t been cleared. Here is part of >> the nc1’s log: >> >> May 15, 2016 1:04:10 PM >> org.apache.hyracks.storage.common.buffercache.BufferCache openFile >> INFO: Opening file: 14 in cache: >> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 >> May 15, 2016 1:04:10 PM >> org.apache.hyracks.storage.common.buffercache.BufferCache openFile >> INFO: Opening file: 13 in cache: >> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 >> May 15, 2016 1:04:10 PM >> org.apache.hyracks.storage.common.buffercache.BufferCache createFile >> INFO: Creating file: >> /nc1/iodevice1/storage/partition_0/hackathon/log_device_idx_log_device/2016-05-15-12-56-48-712_2016-05-15-12-23-31-225_f >> in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 >> May 15, 2016 1:04:10 PM >> org.apache.hyracks.storage.common.buffercache.BufferCache openFile >> INFO: Opening file: 15 in cache: >> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 >> —————————————————————————————————————— >> /// I shut down the cluster from here and start the server right away. >> —————————————————————————————————————— >> May 15, 2016 1:43:12 PM >> org.apache.asterix.transaction.management.service.recovery.RecoveryManager >> startRecoveryRedoPhase >> INFO: Logs REDO phase completed. Redo logs count: 1197 >> May 15, 2016 1:43:12 PM >> org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness >> flush >> INFO: Started a flush operation for index: LSMBTree >> [/nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/] ... >> May 15, 2016 1:43:12 PM >> org.apache.hyracks.storage.common.buffercache.BufferCache createFile >> INFO: Creating file: >> /nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/2016-05-15-13-43-12-680_2016-05-15-13-43-12-680_f >> in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10 >> >> No logs generated in that 43mins. During that time one CPU was exhaustive >> and I remember no file was touched or generated in the asterix foler. Then >> it may not be the problem of the buffercache in the recovery phase? >> >> >> >> >> >>> On May 16, 2016, at 9:28 PM, Mike Carey <dtab...@gmail.com> wrote: >>> >>> Agreed and agreed. But is the spinning on recovery? >>> >>> (What's the role of the buffer cache during recovery?) >>> >>> >>> On 5/17/16 2:10 AM, Jianfeng Jia wrote: >>>> I think the BuffeCache is the core issue, the recovery process may just >> run into the same spin trap where it was stopped. >>>> And I create another issue that we should be able to Abort the task so >> that we don’t need to restart the server. >>>> >>>>> On May 16, 2016, at 7:24 AM, Michael Blow <mblow.apa...@gmail.com> >> wrote: >>>>> >>>>> This might be related: (ASTERIXDB-1438) BufferCache spins indefinitely >> when >>>>> cache is exceeded. >>>>> >>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1438 >>>>> >>>>> Thanks, >>>>> >>>>> -MDB >>>>> >>>>> On Mon, May 16, 2016 at 1:52 AM Mike Carey <dtab...@gmail.com> wrote: >>>>> >>>>>> Glad it worked out - can someone also capture the core issue in >> JIRA? Thx! >>>>>> On May 15, 2016 11:40 PM, "Jianfeng Jia" <jianfeng....@gmail.com> >> wrote: >>>>>> >>>>>>> Great! The server is back now. Thanks a lot! >>>>>>>> On May 15, 2016, at 2:26 PM, Murtadha Hubail <hubail...@gmail.com> >>>>>>> wrote: >>>>>>>> You can delete the existing log files and create new empty ones with >>>>>>> incremented log file number, but it is very important that you don't >>>>>>> delete the checkpoint file. >>>>>>>> Of course any data in the old log files will be lost, but the data >>>>>>> already on disk will be available. >>>>>>>>> On May 15, 2016, at 1:23 PM, Jianfeng Jia <jianfeng....@gmail.com> >>>>>>> wrote: >>>>>>>>> Hi, >>>>>>>>> We submitted a long running join+insert query and stop the cluster >> to >>>>>>> stop running it. However, when it restarted it ran the recovery >> forever, >>>>>>>>> the logs shows that it is creating a lot of buffer cache. >>>>>>>>> >>>>>>>>> In order to bring the cluster back to answer the query, is there >> any >>>>>>> hacking solutions? such as remove the recovery txnlogs? I’m worried >> that >>>>>> it >>>>>>> will ruin the cluster somehow. >>>>>>>>> We are in a contest so any early helps are really appreciated! >> Thanks! >>>>>>>>> >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> >>>>>>>>> Jianfeng Jia >>>>>>>>> PhD Candidate of Computer Science >>>>>>>>> University of California, Irvine >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jianfeng Jia >>>>>>> PhD Candidate of Computer Science >>>>>>> University of California, Irvine >>>>>>> >>>>>>> >>>> >>>> >>>> Best, >>>> >>>> Jianfeng Jia >>>> PhD Candidate of Computer Science >>>> University of California, Irvine >>>> >>>> >>> >> >> >> >> Best, >> >> Jianfeng Jia >> PhD Candidate of Computer Science >> University of California, Irvine >> >>
Best, Jianfeng Jia PhD Candidate of Computer Science University of California, Irvine