Re: cassandra node was put down with oom error

Mia Wed, 01 May 2019 04:02:59 -0700

Hi Sandeep.

I'm not running any manual repair and I think there is no running full repair.
I cannot see any log about repair in system.log these days.
Does full repair have anything to do with using large amount of memory?


Thanks.

On 2019/05/01 10:47:50, Sandeep Nethi <nethisande...@gmail.com> wrote: 
> Are you by any chance running the full repair on these nodes?
> 
> Thanks,
> Sandeep
> 
> On Wed, 1 May 2019 at 10:46 PM, Mia <yeomii...@gmail.com> wrote:
> 
> > Hello, Ayub.
> >
> > I'm using apache cassandra, not dse edition. So I have never used the dse
> > search feature.
> > In my case, all the nodes of the cluster have the same problem.
> >
> > Thanks.
> >
> > On 2019/05/01 06:13:06, Ayub M <hia...@gmail.com> wrote:
> > > Do you have search on the same nodes or is it only cassandra. In my case
> > it
> > > was due to a memory leak bug in dse search that consumed more memory
> > > resulting in oom.
> > >
> > > On Tue, Apr 30, 2019, 2:58 AM yeomii...@gmail.com <yeomii...@gmail.com>
> > > wrote:
> > >
> > > > Hello,
> > > >
> > > > I'm suffering from similar problem with OSS cassandra version3.11.3.
> > > > My cassandra cluster have been running for longer than 1 years and
> > there
> > > > was no problem until this year.
> > > > The cluster is write-intensive, consists of 70 nodes, and all rows
> > have 2
> > > > hr TTL.
> > > > The only change is the read consistency from QUORUM to ONE. (I cannot
> > > > revert this change because of the read latency)
> > > > Below is my compaction strategy.
> > > > ```
> > > > compaction = {'class':
> > > > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> > > > 'compaction_window_size': '3', 'compaction_window_unit': 'MINUTES',
> > > > 'enabled': 'true', 'max_threshold': '32', 'min_threshold': '4',
> > > > 'tombstone_compaction_interval': '60', 'tombstone_threshold': '0.2',
> > > > 'unchecked_tombstone_compaction': 'false'}
> > > > ```
> > > > I've tried rolling restarting the cluster several times,
> > > > but the memory usage of cassandra process always keeps going high.
> > > > I also tried Native Memory Tracking, but it only measured less memory
> > > > usage than the system mesaures (RSS in /proc/{cassandra-pid}/status)
> > > >
> > > > Is there any way that I could figure out the cause of this problem?
> > > >
> > > >
> > > > On 2019/01/26 20:53:26, Jeff Jirsa <jji...@gmail.com> wrote:
> > > > > You’re running DSE so the OSS list may not be much help. Datastax May
> > > > have more insight
> > > > >
> > > > > In open source, the only things offheap that vary significantly are
> > > > bloom filters and compression offsets - both scale with disk space, and
> > > > both increase during compaction. Large STCS compaction can cause pretty
> > > > meaningful allocations for these. Also, if you have an unusually low
> > > > compression chunk size or a very low bloom filter FP ratio, those will
> > be
> > > > larger.
> > > > >
> > > > >
> > > > > --
> > > > > Jeff Jirsa
> > > > >
> > > > >
> > > > > > On Jan 26, 2019, at 12:11 PM, Ayub M <hia...@gmail.com> wrote:
> > > > > >
> > > > > > Cassandra node went down due to OOM, and checking the
> > /var/log/message
> > > > I see below.
> > > > > >
> > > > > > ```
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked oom-killer:
> > > > gfp_mask=0x280da, order=0, oom_score_adj=0
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=/
> > mems_allowed=0
> > > > > > ....
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB (U)
> > 0*8kB
> > > > 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB
> > (U)
> > > > 1*2048kB (M) 3*4096kB (M) = 15908kB
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: 1294*4kB
> > (UM)
> > > > 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) 114*128kB
> > (UEM)
> > > > 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE
> > > > > > M) 2*2048kB (EM) 35*4096kB (UM) = 242632kB
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: 5319*4kB
> > > > (UE) 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB
> > > > 0*1024kB 0*2048kB 0*4096kB = 62500kB
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0
> > > > hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 hugepages_total=0
> > > > hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total pagecache
> > pages
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap cache
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: add 0,
> > > > delete 0, find 0/0
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap  = 0kB
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap = 0kB
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages
> > HighMem/MovableOnly
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages reserved
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ]   uid  tgid
> > > > total_vm      rss nr_ptes swapents oom_score_adj name
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634]     0  2634
> > > > 41614      326      82        0             0 systemd-journal
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690]     0  2690
> > > > 29793      541      27        0             0 lvmetad
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710]     0  2710
> > > > 11892      762      25        0         -1000 systemd-udevd
> > > > > > .....
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774]     0 13774
> > > >  459778    97729     429        0             0 Scan Factory
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506]     0 14506
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586]     0 14586
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588]     0 14588
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589]     0 14589
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598]     0 14598
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599]     0 14599
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600]     0 14600
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601]     0 14601
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679]     0 19679
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680]     0 19680
> > > > 21628     5340      24        0             0 macompatsvc
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084]  1007  9084
> > > > 2822449   260291     810        0             0 java
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509]  1007  8509
> > > > 17223585 14908485   32510        0             0 java
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877]     0 21877
> > > >  461828    97716     318        0             0 ScanAction Mgr
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884]     0 21884
> > > >  496653    98605     340        0             0 OAS Manager
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718]    89 31718
> > > > 25474      486      48        0             0 pickup
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891]  1007  4891
> > > > 26999      191       9        0             0 iostat
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957]  1007  4957
> > > > 26999      192      10        0             0 iostat
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill
> > process
> > > > 8509 (java) score 928 or sacrifice child
> > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509
> > (java)
> > > > total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB,
> > shmem-rss:0kB
> > > > > > ```
> > > > > >
> > > > > > Nothing else runs on this host except dse cassandra with search and
> > > > monitoring agents. Max heap size is set to 31g, the cassandra java
> > process
> > > > seems to be using ~57gb (ram is 62gb) at the time of error.
> > > > > > So I am guess the jvm started using lots of memory and triggered
> > oom
> > > > error.
> > > > > > Is my understanding correct?
> > > > > > That this is linux triggered jvm kill as the jvm was consuming more
> > > > than available memory?
> > > > > >
> > > > > > So in this case jvm was using max of 31g and remaining 26gb its
> > using
> > > > is non-heap memory. Normally this process takes around 42g and the fact
> > > > that at the time of oom moment it was consuming 57g I am suspecting the
> > > > java process to be the culprit rather than victim.
> > > > > >
> > > > > > At the time of issue there was no heap dump taken, I have
> > configured
> > > > it now. But even if heap dump was taken would it have help figure out
> > who
> > > > is consuming more memory. Heapdump would only dump heap memory area,
> > what
> > > > should be used to dump non-heapdump? Native memory tracking is one
> > thing I
> > > > came across.
> > > > > > Any way to have native memory dumped when oom occurs?
> > > > > > Whats the best way to monitor the jvm memory to diagnose oom
> > errors?
> > > > > >
> > > > >
> > > > > ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > > > >
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > > >
> > > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Re: cassandra node was put down with oom error

Reply via email to