On Thu, Aug 19, 2010 at 2:48 PM, Wayne <wav...@gmail.com> wrote:
> I am having some serious problems keeping a 6 node cluster up and running
> and stable under load. Any help would be greatly appreciated.
>
> Basically it always comes back to OOM errors that never seem to subside.
> After 5 minutes or 3 hours of heavy load depending on settings one or more
> nodes seem to go down with an OOM error. Upon restart the node tries to get
> going but ends up with an OOM error again and again and again. This patterns
> repeats and repeats with no way out. Once a node goes south on me deleting
> everything and starting from scratch is my only option. I can never get it
> to stay up again.
>
> I will try to describe the latest in detail. I am running a migration from a
> MySQL database with 24 processes against 6 nodes in python using the std
> thrift client. Should be a piece of cake in cassandra but no matter what I
> have tried it always seems to kill a node after not too long.
>
> 6 x Nodes with 8 cores, 8G heap, 24g physical
> 0.6.4
> JVM Options for lower compaction priority
> JVM 1.6.0_21-b06
>
> After 3 hours one node hard crashed with an OOM error. See below for
> messages.
>
> I restart and then see nothing but MessageDeserializationTask & some
> GCInspector messages.
>
>  INFO [GC inspection] 2010-08-19 17:06:09,228 GCInspector.java (line 143)
> MEMTABLE-POST-FLUSHER             1         1
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 17:06:09,227
> MessageDeserializationTask.java (line 47) dropping message (52,288ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 17:06:09,229 GCInspector.java (line 143)
> AE-SERVICE-STAGE                  0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 17:06:09,229
> MessageDeserializationTask.java (line 47) dropping message (52,290ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 17:06:09,229 GCInspector.java (line 143)
> FLUSH-WRITER-POOL                 1         1
>
>
> After the 2nd restart it dies/freezes completely without actually crashing
> (kill -9 is required). I knew it was dead because nothing was logged for 30
> minutes to the log and nodetool still registered the node as down.
>
> After the 3rd reboot it comes back UP but then eventually crashes again. See
> below for details.
>
> I have tried using key or row cache but cache makes the problem happen in 5
> minutes as there is less memory and the OOM happens almost immediately. I
> have played with setting less concurrent readers and writers and now have it
> set to 16/16 (read/write). I have tried to set the mmap_index_only with no
> change as the issue is really a JVM issue. All other settings are
> std/default.
>
> Does anyone keep their nodes up under load over time? I have been working
> with cassandra for a while now and still have yet to keep anything up under
> load for very long. I know nothing about java, and frankly am starting to
> wonder if I need to be a java programmer to use use cassandra/run in debug
> mode. Any help would be greatly appreciated. These issues in the past made
> me try to use hbase which was solid as a rock (comparatively) but SLOW and
> too complicated...I came back again thinking .6.4 and soon .7 would be
> better but it almost seems worse to me in terms of stability. Frankly next I
> will have to look at other alternatives because cassandra is totally
> unstable. I do not want to give up after all this time but I am out of
> ideas.
>
> HELP!
>
> Messages prior to first crash:
>
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:33:19,985
> MessageDeserializationTask.java (line 47) dropping message (12,024ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 16:33:19,985 GCInspector.java (line 143)
> CONSISTENCY-MANAGER               0         0
>  INFO [GC inspection] 2010-08-19 16:33:19,986 GCInspector.java (line 143)
> LB-TARGET                         0         0
>  INFO [GC inspection] 2010-08-19 16:33:19,986 GCInspector.java (line 143)
> ROW-MUTATION-STAGE                1       167
>  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> MESSAGE-STREAMING-POOL            0         0
>  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> LOAD-BALANCER-STAGE               0         0
>  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> FLUSH-SORTER-POOL                 0         0
>  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> MEMTABLE-POST-FLUSHER             0         0
>  INFO [GC inspection] 2010-08-19 16:33:19,987 GCInspector.java (line 143)
> AE-SERVICE-STAGE                  0         0
>  INFO [GC inspection] 2010-08-19 16:33:19,988 GCInspector.java (line 143)
> FLUSH-WRITER-POOL                 0         0
>  INFO [GC inspection] 2010-08-19 16:33:19,988 GCInspector.java (line 143)
> HINTED-HANDOFF-POOL               1         1
>  INFO [GC inspection] 2010-08-19 16:33:19,988 GCInspector.java (line 148)
> CompactionManager               n/a         0
>  INFO [GMFD:1] 2010-08-19 16:33:19,989 Gossiper.java (line 578) InetAddress
> /10.4.0.202 is now UP
>  INFO [GMFD:1] 2010-08-19 16:33:19,989 Gossiper.java (line 578) InetAddress
> /10.4.0.203 is now UP
>  INFO [GMFD:1] 2010-08-19 16:33:19,990 Gossiper.java (line 578) InetAddress
> /10.4.0.201 is now UP
>  INFO [GMFD:1] 2010-08-19 16:33:19,990 Gossiper.java (line 578) InetAddress
> /10.4.0.204 is now UP
>  INFO [GMFD:1] 2010-08-19 16:33:19,990 Gossiper.java (line 578) InetAddress
> /10.4.0.205 is now UP
>  INFO [ROW-MUTATION-STAGE:19] 2010-08-19 16:33:21,254 ColumnFamilyStore.java
> (line 357) Point.Time.Data.Value has reached its threshold; switching in a
> fresh Memtable at
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1282235556898.log',
> position=15371920)
>  INFO [ROW-MUTATION-STAGE:19] 2010-08-19 16:33:21,254 ColumnFamilyStore.java
> (line 609) Enqueuing flush of
> memtable-point.time.data.va...@951707146(3462840 bytes, 314574 operations)
>  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:33:21,254 Memtable.java (line 148)
> Writing memtable-point.time.data.va...@951707146(3462840 bytes, 314574
> operations)
>  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:33:22,090 Memtable.java (line 162)
> Completed flushing
> /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-362-Data.db
>  INFO [ROW-MUTATION-STAGE:58] 2010-08-19 16:33:23,289 ColumnFamilyStore.java
> (line 357) Point.Data.Time.Value has reached its threshold; switching in a
> fresh Memtable at
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1282235556898.log',
> position=25251401)
>  INFO [ROW-MUTATION-STAGE:58] 2010-08-19 16:33:23,289 ColumnFamilyStore.java
> (line 609) Enqueuing flush of
> memtable-point.data.time.va...@142023252(6330574 bytes, 314746 operations)
>  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:33:23,289 Memtable.java (line 148)
> Writing memtable-point.data.time.va...@142023252(6330574 bytes, 314746
> operations)
>  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:33:23,785 Memtable.java (line 162)
> Completed flushing
> /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1030-Data.db
>  INFO [COMPACTION-POOL:1] 2010-08-19 16:33:23,787 CompactionManager.java
> (line 246) Compacting
> [org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1027-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1028-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1029-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1030-Data.db')]
>  INFO [GC inspection] 2010-08-19 16:34:03,565 GCInspector.java (line 116) GC
> for ConcurrentMarkSweep: 39651 ms, 414341376 reclaimed leaving 8139886200
> used; max is 8700035072
>  INFO [GC inspection] 2010-08-19 16:34:03,566 GCInspector.java (line 137)
> Pool Name                    Active   Pending
>  INFO [GC inspection] 2010-08-19 16:34:03,566 GCInspector.java (line 143)
> STREAM-STAGE                      0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,567 GCInspector.java (line 143)
> FILEUTILS-DELETE-POOL             0         0
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,567
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-358-Data.db
>  INFO [GC inspection] 2010-08-19 16:34:03,567 GCInspector.java (line 143)
> RESPONSE-STAGE                    0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,584 GCInspector.java (line 143)
> ROW-READ-STAGE                    0         0
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,584
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/system/HintsColumnFamily-7-Data.db
>  INFO [GC inspection] 2010-08-19 16:34:03,584 GCInspector.java (line 143)
> LB-OPERATIONS                     0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,585 GCInspector.java (line 143)
> GMFD                              0         1
>  INFO [GC inspection] 2010-08-19 16:34:03,585 GCInspector.java (line 143)
> MESSAGE-DESERIALIZER-POOL         0         0
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,585
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/system/HintsColumnFamily-8-Data.db
>  INFO [GC inspection] 2010-08-19 16:34:03,585 GCInspector.java (line 143)
> CONSISTENCY-MANAGER               0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,586 GCInspector.java (line 143)
> LB-TARGET                         0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,586 GCInspector.java (line 143)
> ROW-MUTATION-STAGE                8        49
>  INFO [GC inspection] 2010-08-19 16:34:03,587 GCInspector.java (line 143)
> MESSAGE-STREAMING-POOL            0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,587 GCInspector.java (line 143)
> LOAD-BALANCER-STAGE               0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,587 GCInspector.java (line 143)
> FLUSH-SORTER-POOL                 0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,588 GCInspector.java (line 143)
> MEMTABLE-POST-FLUSHER             1         1
>  INFO [GC inspection] 2010-08-19 16:34:03,588 GCInspector.java (line 143)
> AE-SERVICE-STAGE                  0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,588 GCInspector.java (line 143)
> FLUSH-WRITER-POOL                 0         0
>  INFO [GC inspection] 2010-08-19 16:34:03,588 GCInspector.java (line 143)
> HINTED-HANDOFF-POOL               1         6
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,589
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-357-Data.db
>  INFO [GC inspection] 2010-08-19 16:34:03,589 GCInspector.java (line 148)
> CompactionManager               n/a         1
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,600
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-356-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:34:03,603
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/Instance.4/Point.Time.Data.Value-359-Data.db
>  INFO [COMPACTION-POOL:1] 2010-08-19 16:34:04,965 CompactionManager.java
> (line 320) Compacted to
> /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1031-Data.db.
> 41190813/36949030 bytes for 9048 keys.  Time: 1399ms.
>  INFO [COMPACTION-POOL:1] 2010-08-19 16:34:04,965 CompactionManager.java
> (line 246) Compacting
> [org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1009-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1017-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1026-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1031-Data.db')]
>  INFO [GC inspection] 2010-08-19 16:34:46,656 GCInspector.java (line 116) GC
> for ConcurrentMarkSweep: 41615 ms, 192522712 reclaimed leaving 8326856720
> used; max is 8700035072
>  INFO [GC inspection] 2010-08-19 16:34:46,657 GCInspector.java (line 137)
> Pool Name                    Active   Pending
>  INFO [GC inspection] 2010-08-19 16:34:46,657 GCInspector.java (line 143)
> STREAM-STAGE                      0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,657 GCInspector.java (line 143)
> FILEUTILS-DELETE-POOL             0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,658 GCInspector.java (line 143)
> RESPONSE-STAGE                    0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,658 GCInspector.java (line 143)
> ROW-READ-STAGE                    0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,658 GCInspector.java (line 143)
> LB-OPERATIONS                     0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,659 GCInspector.java (line 143)
> GMFD                              0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,659 GCInspector.java (line 143)
> MESSAGE-DESERIALIZER-POOL         1         1
>  INFO [GC inspection] 2010-08-19 16:34:46,659 GCInspector.java (line 143)
> CONSISTENCY-MANAGER               0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,660 GCInspector.java (line 143)
> LB-TARGET                         0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,660 GCInspector.java (line 143)
> ROW-MUTATION-STAGE                4         1
>  INFO [GC inspection] 2010-08-19 16:34:46,660 GCInspector.java (line 143)
> MESSAGE-STREAMING-POOL            0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,661 GCInspector.java (line 143)
> LOAD-BALANCER-STAGE               0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,661 GCInspector.java (line 143)
> FLUSH-SORTER-POOL                 0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,661 GCInspector.java (line 143)
> MEMTABLE-POST-FLUSHER             0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,662 GCInspector.java (line 143)
> AE-SERVICE-STAGE                  0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,662 GCInspector.java (line 143)
> FLUSH-WRITER-POOL                 0         0
>  INFO [GC inspection] 2010-08-19 16:34:46,663 GCInspector.java (line 143)
> HINTED-HANDOFF-POOL               1         6
>  INFO [GC inspection] 2010-08-19 16:34:46,663 GCInspector.java (line 148)
> CompactionManager               n/a         1
>  INFO [Timer-0] 2010-08-19 16:36:00,785 Gossiper.java (line 180) InetAddress
> /10.4.0.205 is now dead.
>  INFO [Timer-0] 2010-08-19 16:36:00,786 Gossiper.java (line 180) InetAddress
> /10.4.0.204 is now dead.
>  INFO [GC inspection] 2010-08-19 16:36:00,786 GCInspector.java (line 116) GC
> for ConcurrentMarkSweep: 37122 ms, 157488 reclaimed leaving 8342836376 used;
> max is 8700035072
>  INFO [GC inspection] 2010-08-19 16:36:00,789 GCInspector.java (line 137)
> Pool Name                    Active   Pending
>  INFO [GC inspection] 2010-08-19 16:36:00,790 GCInspector.java (line 143)
> STREAM-STAGE                      0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,790 GCInspector.java (line 143)
> FILEUTILS-DELETE-POOL             0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,790 GCInspector.java (line 143)
> RESPONSE-STAGE                    0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,790 GCInspector.java (line 143)
> ROW-READ-STAGE                    0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,791 GCInspector.java (line 143)
> LB-OPERATIONS                     0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,791 GCInspector.java (line 143)
> GMFD                              0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,791 GCInspector.java (line 143)
> MESSAGE-DESERIALIZER-POOL         0        82
>  INFO [GC inspection] 2010-08-19 16:36:00,791 GCInspector.java (line 143)
> CONSISTENCY-MANAGER               0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,792 GCInspector.java (line 143)
> LB-TARGET                         0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,792 GCInspector.java (line 143)
> ROW-MUTATION-STAGE                3       770
>  INFO [GC inspection] 2010-08-19 16:36:00,792 GCInspector.java (line 143)
> MESSAGE-STREAMING-POOL            0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,792 GCInspector.java (line 143)
> LOAD-BALANCER-STAGE               0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,793 GCInspector.java (line 143)
> FLUSH-SORTER-POOL                 0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,793 GCInspector.java (line 143)
> MEMTABLE-POST-FLUSHER             0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,793 GCInspector.java (line 143)
> AE-SERVICE-STAGE                  0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:36:00,796
> MessageDeserializationTask.java (line 47) dropping message (44,076ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:36:00,797
> MessageDeserializationTask.java (line 47) dropping message (44,077ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:36:00,797
> MessageDeserializationTask.java (line 47) dropping message (44,077ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 16:36:00,798
> MessageDeserializationTask.java (line 47) dropping message (44,078ms past
> timeout)
>  INFO [Timer-0] 2010-08-19 16:36:00,797 Gossiper.java (line 180) InetAddress
> /10.4.0.203 is now dead.
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:36:00,797
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1030-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:36:00,805
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1028-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:36:00,808
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1027-Data.db
>  INFO [SSTABLE-CLEANUP-TIMER] 2010-08-19 16:36:00,812
> SSTableDeletingReference.java (line 104) Deleted
> /mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1029-Data.db
>  INFO [GC inspection] 2010-08-19 16:36:00,833 GCInspector.java (line 143)
> FLUSH-WRITER-POOL                 0         0
>  INFO [GC inspection] 2010-08-19 16:36:00,834 GCInspector.java (line 143)
> HINTED-HANDOFF-POOL               1         6
>  INFO [GMFD:1] 2010-08-19 16:36:00,834 Gossiper.java (line 578) InetAddress
> /10.4.0.204 is now UP
>  INFO [GC inspection] 2010-08-19 16:36:00,834 GCInspector.java (line 148)
> CompactionManager               n/a         1
>  INFO [GMFD:1] 2010-08-19 16:36:00,839 Gossiper.java (line 578) InetAddress
> /10.4.0.203 is now UP
>  INFO [GMFD:1] 2010-08-19 16:36:00,839 Gossiper.java (line 578) InetAddress
> /10.4.0.205 is now UP
>  INFO [ROW-MUTATION-STAGE:34] 2010-08-19 16:36:00,898 ColumnFamilyStore.java
> (line 357) Point.Time.Data.Value has reached its threshold; switching in a
> fresh Memtable at
> CommitLogContext(file='/var/lib/cassandra/commitlog/CommitLog-1282235556898.log',
> position=41642302)
>  INFO [ROW-MUTATION-STAGE:34] 2010-08-19 16:36:00,898 ColumnFamilyStore.java
> (line 609) Enqueuing flush of
> memtable-point.time.data.va...@543707511(4032380 bytes, 374579 operations)
>  INFO [FLUSH-WRITER-POOL:1] 2010-08-19 16:36:00,898 Memtable.java (line 148)
> Writing memtable-point.time.data.va...@543707511(4032380 bytes, 374579
> operations)
> ERROR [Thread-121] 2010-08-19 16:36:00,911 CassandraDaemon.java (line 83)
> Uncaught exception in thread Thread[Thread-121,5,main]
> java.lang.OutOfMemoryError: Java heap space
>     at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71)
>
>
>
> Messages prior to last crash.
>
>
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:06:32,245
> MessageDeserializationTask.java (line 47) dropping message (525,696ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:09,391
> MessageDeserializationTask.java (line 47) dropping message (562,843ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:07:09,391 GCInspector.java (line 143)
> LB-OPERATIONS                     0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:09,392
> MessageDeserializationTask.java (line 47) dropping message (562,843ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:07:09,392 GCInspector.java (line 143)
> GMFD                              0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:09,392
> MessageDeserializationTask.java (line 47) dropping message (562,844ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:09,394
> MessageDeserializationTask.java (line 47) dropping message (562,846ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:42,315
> MessageDeserializationTask.java (line 47) dropping message (595,767ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:42,316
> MessageDeserializationTask.java (line 47) dropping message (595,768ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:07:42,316
> MessageDeserializationTask.java (line 47) dropping message (595,768ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:07:41,965 GCInspector.java (line 143)
> MESSAGE-DESERIALIZER-POOL         1    567231
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:08:18,189
> MessageDeserializationTask.java (line 47) dropping message (631,641ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:08:18,192 GCInspector.java (line 143)
> CONSISTENCY-MANAGER               0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:08:18,193
> MessageDeserializationTask.java (line 47) dropping message (631,645ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:09:18,355
> MessageDeserializationTask.java (line 47) dropping message (691,807ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:09:18,356 GCInspector.java (line 143)
> LB-TARGET                         0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:09:44,730
> MessageDeserializationTask.java (line 47) dropping message (691,808ms past
> timeout)
> ERROR [COMPACTION-POOL:1] 2010-08-19 18:18:42,658 CassandraDaemon.java (line
> 83) Uncaught exception in thread Thread[COMPACTION-POOL:1,1,main]
> java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: Java
> heap space
>     at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>     at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>     at
> org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86)
>     at
> org.apache.cassandra.db.CompactionManager$CompactionExecutor.afterExecute(CompactionManager.java:577)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>     at java.lang.Thread.run(Thread.java:619)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>     at
> org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:84)
>     at
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:335)
>     at
> org.apache.cassandra.db.SuperColumnSerializer.deserialize(SuperColumn.java:284)
>     at
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:117)
>     at
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeFromSSTable(ColumnFamilySerializer.java:160)
>     at
> org.apache.cassandra.io.IteratingRow.getColumnFamily(IteratingRow.java:79)
>     at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:112)
>     at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:41)
>     at
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>     at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>     at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>     at
> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>     at
> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>     at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:295)
>     at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:102)
>     at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:83)
>     at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>     at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>     ... 2 more
> ERROR [Thread-13] 2010-08-19 18:18:42,658 CassandraDaemon.java (line 83)
> Uncaught exception in thread Thread[Thread-13,5,main]
> java.lang.OutOfMemoryError: Java heap space
>     at java.util.concurrent.FutureTask.<init>(FutureTask.java:64)
>     at
> java.util.concurrent.AbstractExecutorService.newTaskFor(AbstractExecutorService.java:58)
>     at
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:77)
>     at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:73)
>  INFO [GC inspection] 2010-08-19 18:18:42,658 GCInspector.java (line 143)
> ROW-MUTATION-STAGE                3         3
>  INFO [GC inspection] 2010-08-19 18:18:42,666 GCInspector.java (line 143)
> MESSAGE-STREAMING-POOL            0         0
>  INFO [COMPACTION-POOL:1] 2010-08-19 18:18:42,659 CompactionManager.java
> (line 246) Compacting
> [org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1071-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1072-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1073-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1074-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1075-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1076-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1077-Data.db'),org.apache.cassandra.io.SSTableReader(path='/mainraid/cassandra/data/Instance.4/Point.Data.Time.Value-1078-Data.db')]
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,658
> MessageDeserializationTask.java (line 47) dropping message (1,256,110ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:18:42,666 GCInspector.java (line 143)
> LOAD-BALANCER-STAGE               0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,667
> MessageDeserializationTask.java (line 47) dropping message (1,256,119ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,960
> MessageDeserializationTask.java (line 47) dropping message (1,256,412ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,960
> MessageDeserializationTask.java (line 47) dropping message (1,256,412ms past
> timeout)
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,960
> MessageDeserializationTask.java (line 47) dropping message (1,256,412ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:18:42,960 GCInspector.java (line 143)
> FLUSH-SORTER-POOL                 0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,960
> MessageDeserializationTask.java (line 47) dropping message (1,256,412ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:18:42,962 GCInspector.java (line 143)
> MEMTABLE-POST-FLUSHER             0         0
>  WARN [MESSAGE-DESERIALIZER-POOL:1] 2010-08-19 18:18:42,963
> MessageDeserializationTask.java (line 47) dropping message (1,256,415ms past
> timeout)
>  INFO [GC inspection] 2010-08-19 18:18:42,965 GCInspector.java (line 143)
> AE-SERVICE-STAGE                  0         0
>  INFO [GC inspection] 2010-08-19 18:18:42,968 GCInspector.java (line 143)
> FLUSH-WRITER-POOL                 0         0
>
>
>
>

First,
I have some general questions. Are you using random partitioner? Are
your tokens calculated properly to spread data around the ring or did
cassandra auto-choose tokens? What to your system vitals look like?
What is the disk statistics are the disks saturated? Is the system in
IO wait? top, iostat,sar. What does CFSTATS say? What is your
largestRow? What is your average row? What is your cache hit rate?
What is your replication factor?  Exactly how many rows are you
inserting a second? If you throttle your load while tuning up your
settings that is more scientific then firing full blast bulk loads
that causes crashes.

I notice you said this... 6 x Nodes with 8 cores, 8G heap, 24g physical.
I would go higher with the heap 8 cores, 8G heap, 24g physical. If you
are hitting OOM with 8G go higher. If you are going to use row cache
you should probably bring this really high.

You are getting OOM. I notice you tried rowcaching and keycaching. How
did you set this to fixed size or % ? You should almost always set it
to fixed size.

Here is something profound that I learned:
Larger row cache can SAVE you memory.
What huh? How can that be? more memory uses less memory?
Having a small row/key cache caused higher disk utilization. If all
requests are going to disk request take longer using more memory for
longer. BUT if you get that request served off Row Cache. Pow Zip Bang
the request is in and out fast, and you never get into using those
longer lived java memory pools.
Then again if your workload is ALL write, caches will not help you with that.

Set this low:
 <RowWarningThresholdInMB>10</RowWarningThresholdInMB>
 I found that rows with millions of columns would cause havoc for me.
This helped me find some bad rows.

How much data is on each node 'nodetool ring' output

Reply via email to