[ https://issues.apache.org/jira/browse/CASSANDRA-8295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14207203#comment-14207203 ]
Jose Martinez Poblete commented on CASSANDRA-8295: -------------------------------------------------- More info from MAT {noformat} Class Name Objects Shallow Heap java.nio.HeapByteBuffer First 10 of 73,845,620 objects 73,845,620 3,544,589,760 edu.stanford.ppl.concurrent.SnapTreeMap$Node First 10 of 34,614,044 objects 34,614,044 1,661,474,112 byte[] First 10 of 3,969,475 objects 3,969,475 1,510,362,528 org.apache.cassandra.db.Column First 10 of 34,614,043 objects 34,614,043 1,107,649,376 edu.stanford.ppl.concurrent.CopyOnWriteManager$COWEpoch First 10 of 411,924 objects 411,924 39,544,704 java.nio.ByteBuffer[] First 10 of 823,848 objects 823,848 30,913,568 long[] First 10 of 411,924 objects 411,924 22,819,304 edu.stanford.ppl.concurrent.SnapTreeMap$RootHolder First 10 of 411,924 objects 411,924 19,772,352 org.apache.cassandra.db.RangeTombstoneList First 10 of 411,924 objects 411,924 16,476,960 int[] First 10 of 411,924 objects 411,924 15,456,784 edu.stanford.ppl.concurrent.CopyOnWriteManager$Latch First 10 of 411,924 objects 411,924 13,181,568 edu.stanford.ppl.concurrent.SnapTreeMap First 10 of 411,924 objects 411,924 13,181,568 java.util.concurrent.atomic.AtomicReference First 10 of 823,848 objects 823,848 13,181,568 java.util.concurrent.ConcurrentSkipListMap$Node First 10 of 411,929 objects 411,929 9,886,296 org.apache.cassandra.db.DecoratedKey First 10 of 411,928 objects 411,928 9,886,272 java.lang.Long First 10 of 411,928 objects 411,928 9,886,272 org.apache.cassandra.db.AtomicSortedColumns First 10 of 411,924 objects 411,924 9,886,176 org.apache.cassandra.db.AtomicSortedColumns$Holder First 10 of 411,924 objects 411,924 9,886,176 org.apache.cassandra.db.DeletionInfo First 10 of 411,924 objects 411,924 9,886,176 org.apache.cassandra.dht.LongToken First 10 of 411,928 objects 411,928 6,590,848 edu.stanford.ppl.concurrent.SnapTreeMap$COWMgr First 10 of 411,924 objects 411,924 6,590,784 java.util.concurrent.ConcurrentSkipListMap$Index First 10 of 207,065 objects 207,065 4,969,560 java.util.concurrent.ConcurrentSkipListMap$HeadIndex First 10 of 16 objects 16 512 org.apache.cassandra.db.DeletedColumn All 1 objects 1 32 Total: 24 entries 155,076,837 8,086,073,256 {noformat} > Cassandra runs OOM @ java.util.concurrent.ConcurrentSkipListMap$HeadIndex > ------------------------------------------------------------------------- > > Key: CASSANDRA-8295 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8295 > Project: Cassandra > Issue Type: Bug > Components: Core > Environment: DSE 4.5.3 Cassandra 2.0.11.82 > Reporter: Jose Martinez Poblete > Attachments: alln01-ats-cas3.cassandra.yaml, output.tgz, system.tgz, > system.tgz.1, system.tgz.2, system.tgz.3 > > > Customer runs a 3 node cluster > Their dataset is less than 1Tb and during data load, one of the nodes enter a > GC death spiral: > {noformat} > INFO [ScheduledTasks:1] 2014-11-07 23:31:08,094 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 3348 ms for 2 collections, 1658268944 used; max > is 8375238656 > INFO [ScheduledTasks:1] 2014-11-07 23:40:58,486 GCInspector.java (line 116) > GC for ParNew: 442 ms for 2 collections, 6079570032 used; max is 8375238656 > INFO [ScheduledTasks:1] 2014-11-07 23:40:58,487 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 7351 ms for 2 collections, 6084678280 used; max > is 8375238656 > INFO [ScheduledTasks:1] 2014-11-07 23:41:01,836 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 603 ms for 1 collections, 7132546096 used; max is > 8375238656 > INFO [ScheduledTasks:1] 2014-11-07 23:41:09,626 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 761 ms for 1 collections, 7286946984 used; max is > 8375238656 > INFO [ScheduledTasks:1] 2014-11-07 23:41:15,265 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 703 ms for 1 collections, 7251213520 used; max is > 8375238656 > INFO [ScheduledTasks:1] 2014-11-07 23:41:25,027 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 1205 ms for 1 collections, 6507586104 used; max > is 8375238656 > INFO [ScheduledTasks:1] 2014-11-07 23:41:41,374 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 13835 ms for 3 collections, 6514187192 used; max > is 8375238656 > INFO [ScheduledTasks:1] 2014-11-07 23:41:54,137 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 6834 ms for 2 collections, 6521656200 used; max > is 8375238656 > ... > INFO [ScheduledTasks:1] 2014-11-08 12:13:11,086 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 43967 ms for 2 collections, 8368777672 used; max > is 8375238656 > INFO [ScheduledTasks:1] 2014-11-08 12:14:14,151 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 63968 ms for 3 collections, 8369623824 used; max > is 8375238656 > INFO [ScheduledTasks:1] 2014-11-08 12:14:55,643 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 41307 ms for 2 collections, 8370115376 used; max > is 8375238656 > INFO [ScheduledTasks:1] 2014-11-08 12:20:06,197 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 309634 ms for 15 collections, 8374994928 used; > max is 8375238656 > INFO [ScheduledTasks:1] 2014-11-08 13:07:33,617 GCInspector.java (line 116) > GC for ConcurrentMarkSweep: 2681100 ms for 143 collections, 8347631560 used; > max is 8375238656 > {noformat} > Their application waits 1 minute before a retry when a timeout is returned > This is what we find on their heapdumps: > {noformat} > Class Name > > > | Shallow Heap > | Retained Heap | Percentage > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > org.apache.cassandra.db.Memtable @ 0x773f52f80 > > > | 72 > | 8,086,073,504 | 96.66% > |- java.util.concurrent.ConcurrentSkipListMap @ 0x724508fe8 > > > | 48 > | 8,086,073,320 | 96.66% > | |- java.util.concurrent.ConcurrentSkipListMap$HeadIndex @ 0x64f9219a0 > > > | 32 > | 8,086,073,256 | 96.66% > | | |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x614b081a8 > > > | 24 > | 16,230,976 | 0.19% > | | |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x7da171948 > > > | 24 > | 4,922,288 | 0.06% > | | |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x7f4518a80 > > > | 24 > | 4,405,496 | 0.05% > | | |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x611d69d10 > > > | 24 > | 3,737,672 | 0.04% > | | |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x71cd2fae8 > > > | 24 > | 2,921,048 | 0.03% > | | |- java.util.concurrent.ConcurrentSkipListMap$HeadIndex @ 0x728faed50 > > > | 32 > | 2,012,592 | 0.02% > | | |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x6387eb950 > > > | 24 > | 1,641,696 | 0.02% > | | |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x727f474f0 > > > | 24 > | 1,328,936 | 0.02% > | | |- java.util.concurrent.ConcurrentSkipListMap$Node @ 0x70d7a02b0 > > > | 24 > | 1,050,624 | 0.01% > | | |- byte[1048576] @ 0x7d87873d8 > .........8.........CS.l`...attributes...slot..............attributes...runtime......A..<x.........C.......attributes...procgid.87.....CS.`....attributes...bflush.00.....CV......attributes...username........uV....server.f1432541.........8...server......A..<...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x60ab7b920 > .....7...p...attributes...tottime....../..%....area......0.......attributes...lineid.56.....7.i.....attributes...tottime.156258924.....0B)\....container.4...../.......server....,PTXCALsdihqprod1\sdihqprod1...../.......machine.fxcdom1.....7.i.....attributes...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x609fb54f8 > .....E.......attributes...lineid.901137423.....E.......attributes...testr1.1413.....E.......attributes...testr2.M393B1K70QB0-YK02014-01-03 > > 06:46:31.....E.......attributes...tenum1name.EFSTLOOP.....CV......attributes...numunits.1.....E.......attributes...pa...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x60a0b5508 > .....E.z.....area......?.......attributes...labelnum.SYSFA.....0"U.....attributes...testr1name.D75165799...../..^....attributes...crc.Hexload_Bootloader.....E.TR....machine......0.......attributes...bflush....../..&....attributes...majline....../._.P...att...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7d8f5e2b8 > ......B9.....machine.solfr5.......L.....attributes...runtime.146.............attributes...tottime.109.......t.h...attributes...bmap.0......B9.....uuttype.VIP2-40=.......L.....attributes...cpptimeid.2006-04-11 > 10:53:48.............attributes...partnum.73-91...| 1,048,592 | > 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7d905e2c8 > .....E.|.x...attributes...runtime.310.....E.P.0...attributes...partnum.15-13637-02.....E./<....area.SYSFA.....E./<....passfail.S.....E.|.x...attributes...testr1.1413.....E.P.0...attributes...partnum2.15-13637-02.....E./<....container....T.....E./<....attri...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7d915e2d8 > ...../..l........../..l....server.sdihqprod1\sdihqprod1...../..l....machine.fxcdom1...../..l....uuttype.73-12304-03...../..l....area.PASTE...../..l....passfail.P...../..l....container........../..l....attributes...majline.0...../..l....attributes...subslot...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7d925e2e8 > .............attributes...tottime.42........KH...attributes...testtime.........3....uuttype.0.............attributes...runtime..............attributes...procgid.73-9341-021417817........3....area.PASTE.............attributes...test.PASSED........3....passf...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7be7473f0 > .....=..Jh.........=...(...attributes...runtime.f6f1298f-830f-47f4-b1dd-1adb07b99ff9653.....=..*(...attributes...numunits.1.....=..Jh...server......=.._....attributes...testr3name.RCDN9HQPROD1\RCDN9HQPROD1CPPVersion:3.6.2803.0.....=..*(...attributes...test...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7be847800 .........0...passfail.P........ > (...attributes...bflush.0.......w.....attributes...tottime.1161.....A.b.(...area.SYSVF.......kH....attributes...test..............attributes...runtime.PASSED.....A.zl....attributes...procgid.2.........0...container.............| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7be949070 > .....=...0...attributes...pcid......=..>....attributes...cpptimeid.6cc40f78-9525-4488-909f-2247d9537cf82013-04-04 > > 19:24:23.....=.Z.....attributes...runtime.0.....=...0...attributes...testr3name.CPPVersion:3.6.2803.0.....=.............=...0...attributes...p...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7bea4a8e0 > .....>{A0....uuttype.FJZPROD1\FJZPROD1.....>z.Mp...attributes...pcid......Ct..(...attributes...lineid......=..n8...container......=.oE..........4B......machine.F2049802CBLSTB-4044066-K9fxhmcekit2.....=.p(h...attributes...proctime......>z..`...machine.........| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7beb4a8f0 > .....A../....attributes...partnum2......B'......attributes...username....D.....B'.L....area.f1303257.....A...P...area.74-8071-01F118190965553.....A.......server.PCBDLSYSPM.......$.....attributes...runtime......>{r+....attributes...cpptimeid......B.B. > ...co...| 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7bec4a900 > .....=..;x...attributes...username.tczpawe73-100074-01.....=..;x...attributes...slot.0.....=.......area.ASSY.....=..;x...attributes...lineid......=.......passfail.0P.....=.......container..........=..;x...attributes...numunits.1.....=.......attributes...pa...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x7bed55ff0 > .....=oC/....machine......"..Q....uuttype......"...`...attributes...parentsernum.73-8479-02FCZ133171DPfxcestgfqa1....."`..x...attributes...test......=l.2p...attributes...tenum3.8242009070919300730FOC13283D6A....."..Q....area......"...(...attributes...bflus...| > 1,048,592 | 1,048,592 | 0.01% > | | |- byte[1048576] @ 0x61cf45088 > .....CSaL....server.FXCPROD1\FXCPROD1.....CW......passfail.P.....CSr.....attributes...runtime.50.....CSaL....machine.foxchict217.....CW......container..........CSaL....uuttype......CW......attributes...username.73-13315-03xzhang.....CSr.....attributes...te...| > 1,048,592 | 1,048,592 | 0.01% > | | '- Total: 25 of 166,289 entries; 166,264 more > > > | > | | > | |- java.util.concurrent.ConcurrentSkipListMap$EntrySet @ 0x72541dc58 > > > | 16 > | 16 | 0.00% > | '- Total: 2 entries > > > | > | | > |- org.github.jamm.MemoryMeter @ 0x72541db50 > > > | 24 > | 40 | 0.00% > |- java.util.concurrent.atomic.AtomicLong @ 0x72541db68 > > > | 24 > | 24 | 0.00% > |- java.util.concurrent.atomic.AtomicLong @ 0x72541db80 > > > | 24 > | 24 | 0.00% > |- java.util.concurrent.atomic.AtomicLong @ 0x72541db38 > > > | 24 > | 24 | 0.00% > '- Total: 5 entries > > > | > | | > ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- > {noformat} > They are using the defaults at cassandra.yaml which means sstables should not > use that much heap. Setting the following have been of no use: > {noformat} > memtable_total_space_in_mb: 2000 > memtable_flush_queue_size: 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)