I think 3.11.3 has some bug and which can cause OOMs on nodes with full repairs. Just check if there is any correlation with ooms and repair process.
Thanks, Sandeep On Wed, 1 May 2019 at 11:02 PM, Mia <yeomii...@gmail.com> wrote: > Hi Sandeep. > > I'm not running any manual repair and I think there is no running full > repair. > I cannot see any log about repair in system.log these days. > Does full repair have anything to do with using large amount of memory? > > Thanks. > > On 2019/05/01 10:47:50, Sandeep Nethi <nethisande...@gmail.com> wrote: > > Are you by any chance running the full repair on these nodes? > > > > Thanks, > > Sandeep > > > > On Wed, 1 May 2019 at 10:46 PM, Mia <yeomii...@gmail.com> wrote: > > > > > Hello, Ayub. > > > > > > I'm using apache cassandra, not dse edition. So I have never used the > dse > > > search feature. > > > In my case, all the nodes of the cluster have the same problem. > > > > > > Thanks. > > > > > > On 2019/05/01 06:13:06, Ayub M <hia...@gmail.com> wrote: > > > > Do you have search on the same nodes or is it only cassandra. In my > case > > > it > > > > was due to a memory leak bug in dse search that consumed more memory > > > > resulting in oom. > > > > > > > > On Tue, Apr 30, 2019, 2:58 AM yeomii...@gmail.com < > yeomii...@gmail.com> > > > > wrote: > > > > > > > > > Hello, > > > > > > > > > > I'm suffering from similar problem with OSS cassandra > version3.11.3. > > > > > My cassandra cluster have been running for longer than 1 years and > > > there > > > > > was no problem until this year. > > > > > The cluster is write-intensive, consists of 70 nodes, and all rows > > > have 2 > > > > > hr TTL. > > > > > The only change is the read consistency from QUORUM to ONE. (I > cannot > > > > > revert this change because of the read latency) > > > > > Below is my compaction strategy. > > > > > ``` > > > > > compaction = {'class': > > > > > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy', > > > > > 'compaction_window_size': '3', 'compaction_window_unit': 'MINUTES', > > > > > 'enabled': 'true', 'max_threshold': '32', 'min_threshold': '4', > > > > > 'tombstone_compaction_interval': '60', 'tombstone_threshold': > '0.2', > > > > > 'unchecked_tombstone_compaction': 'false'} > > > > > ``` > > > > > I've tried rolling restarting the cluster several times, > > > > > but the memory usage of cassandra process always keeps going high. > > > > > I also tried Native Memory Tracking, but it only measured less > memory > > > > > usage than the system mesaures (RSS in > /proc/{cassandra-pid}/status) > > > > > > > > > > Is there any way that I could figure out the cause of this problem? > > > > > > > > > > > > > > > On 2019/01/26 20:53:26, Jeff Jirsa <jji...@gmail.com> wrote: > > > > > > You’re running DSE so the OSS list may not be much help. > Datastax May > > > > > have more insight > > > > > > > > > > > > In open source, the only things offheap that vary significantly > are > > > > > bloom filters and compression offsets - both scale with disk > space, and > > > > > both increase during compaction. Large STCS compaction can cause > pretty > > > > > meaningful allocations for these. Also, if you have an unusually > low > > > > > compression chunk size or a very low bloom filter FP ratio, those > will > > > be > > > > > larger. > > > > > > > > > > > > > > > > > > -- > > > > > > Jeff Jirsa > > > > > > > > > > > > > > > > > > > On Jan 26, 2019, at 12:11 PM, Ayub M <hia...@gmail.com> wrote: > > > > > > > > > > > > > > Cassandra node went down due to OOM, and checking the > > > /var/log/message > > > > > I see below. > > > > > > > > > > > > > > ``` > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java invoked > oom-killer: > > > > > gfp_mask=0x280da, order=0, oom_score_adj=0 > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: java cpuset=/ > > > mems_allowed=0 > > > > > > > .... > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA: 1*4kB > (U) > > > 0*8kB > > > > > 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB > 1*1024kB > > > (U) > > > > > 1*2048kB (M) 3*4096kB (M) = 15908kB > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 DMA32: > 1294*4kB > > > (UM) > > > > > 932*8kB (UEM) 897*16kB (UEM) 483*32kB (UEM) 224*64kB (UEM) > 114*128kB > > > (UEM) > > > > > 41*256kB (UEM) 12*512kB (UEM) 7*1024kB (UE > > > > > > > M) 2*2048kB (EM) 35*4096kB (UM) = 242632kB > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 Normal: > 5319*4kB > > > > > (UE) 3233*8kB (UEM) 960*16kB (UE) 0*32kB 0*64kB 0*128kB 0*256kB > 0*512kB > > > > > 0*1024kB 0*2048kB 0*4096kB = 62500kB > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 > hugepages_total=0 > > > > > hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Node 0 > hugepages_total=0 > > > > > hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 38109 total > pagecache > > > pages > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages in swap > cache > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Swap cache stats: > add 0, > > > > > delete 0, find 0/0 > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Free swap = 0kB > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Total swap = 0kB > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 16394647 pages RAM > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 0 pages > > > HighMem/MovableOnly > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: 310559 pages > reserved > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ pid ] uid tgid > > > > > total_vm rss nr_ptes swapents oom_score_adj name > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2634] 0 2634 > > > > > 41614 326 82 0 0 systemd-journal > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2690] 0 2690 > > > > > 29793 541 27 0 0 lvmetad > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 2710] 0 2710 > > > > > 11892 762 25 0 -1000 systemd-udevd > > > > > > > ..... > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [13774] 0 13774 > > > > > 459778 97729 429 0 0 Scan Factory > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14506] 0 14506 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14586] 0 14586 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14588] 0 14588 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14589] 0 14589 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14598] 0 14598 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14599] 0 14599 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14600] 0 14600 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [14601] 0 14601 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19679] 0 19679 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [19680] 0 19680 > > > > > 21628 5340 24 0 0 macompatsvc > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 9084] 1007 9084 > > > > > 2822449 260291 810 0 0 java > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 8509] 1007 8509 > > > > > 17223585 14908485 32510 0 0 java > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21877] 0 21877 > > > > > 461828 97716 318 0 0 ScanAction Mgr > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [21884] 0 21884 > > > > > 496653 98605 340 0 0 OAS Manager > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [31718] 89 31718 > > > > > 25474 486 48 0 0 pickup > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4891] 1007 4891 > > > > > 26999 191 9 0 0 iostat > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: [ 4957] 1007 4957 > > > > > 26999 192 10 0 0 iostat > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Out of memory: Kill > > > process > > > > > 8509 (java) score 928 or sacrifice child > > > > > > > Jan 23 20:07:17 ip-xxx-xxx-xxx-xxx kernel: Killed process 8509 > > > (java) > > > > > total-vm:68894340kB, anon-rss:59496344kB, file-rss:137596kB, > > > shmem-rss:0kB > > > > > > > ``` > > > > > > > > > > > > > > Nothing else runs on this host except dse cassandra with > search and > > > > > monitoring agents. Max heap size is set to 31g, the cassandra java > > > process > > > > > seems to be using ~57gb (ram is 62gb) at the time of error. > > > > > > > So I am guess the jvm started using lots of memory and > triggered > > > oom > > > > > error. > > > > > > > Is my understanding correct? > > > > > > > That this is linux triggered jvm kill as the jvm was consuming > more > > > > > than available memory? > > > > > > > > > > > > > > So in this case jvm was using max of 31g and remaining 26gb its > > > using > > > > > is non-heap memory. Normally this process takes around 42g and the > fact > > > > > that at the time of oom moment it was consuming 57g I am > suspecting the > > > > > java process to be the culprit rather than victim. > > > > > > > > > > > > > > At the time of issue there was no heap dump taken, I have > > > configured > > > > > it now. But even if heap dump was taken would it have help figure > out > > > who > > > > > is consuming more memory. Heapdump would only dump heap memory > area, > > > what > > > > > should be used to dump non-heapdump? Native memory tracking is one > > > thing I > > > > > came across. > > > > > > > Any way to have native memory dumped when oom occurs? > > > > > > > Whats the best way to monitor the jvm memory to diagnose oom > > > errors? > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > > > > > > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > > > > > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > > > For additional commands, e-mail: user-h...@cassandra.apache.org > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >