[ 
https://issues.apache.org/jira/browse/CASSANDRA-9021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14378030#comment-14378030
 ] 

Benedict commented on CASSANDRA-9021:
-------------------------------------

As it happens, test-compression was already exhibiting the bug, we just aren't 
monitoring it closely since not all of the compression tests pass, and it's 
only running under "test-all" which we seem to still ignore. I would prefer to 
stick to this as the route to track this bug, since it feels a bit too specific 
to have its own unit test (it's an upshot of the code history that preceded it, 
which cannot proceed it, and the ordering we'd be testing for are specific to 
the current status quo - testing for a historical accident rather than a 
general functional failure seems false confidence inducing) whereas the tests 
that already exhibit it are more general functional tests that should exhibit 
any such misordering, not just this one. It also wouldn't be quite so easy to 
exhibit as stands, since it was depending on reallocation of the memory 
location to corrupt the state and trigger the problem. We could insert a burn 
test for sstable compaction, ensuring they behave correctly at all times for 
their duration. I would be in favour of this, as it would put all of these 
paths more thoroughly through their paces, but I think that warrants a separate 
endeavour.

There was an interesting question as to why it was relying on memory corruption 
and not asserting the memory had been freed more aggressively, which I've 
tracked down to SafeMemory not actually zeroing out its peer, since the free is 
performed by its internal Tidy. However there is no reason to not immediately 
zero the peer on closing a SafeMemory, since any access of a closed SafeMemory 
is bad, whether or not the shared memory allocation is still live. So I'm 
attaching another micro patch that would change this behaviour, and prints much 
more helpful error messages when running test-compression.

> AssertionError and Leak detected during sstable compaction
> ----------------------------------------------------------
>
>                 Key: CASSANDRA-9021
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9021
>             Project: Cassandra
>          Issue Type: Bug
>         Environment: Cluster setup:
> - 20-node stress cluster, GCE n1-standard-2
> - 10-node receiver cluster ingesting data, GCE n1-standard-8 
> Platform:
> - Ubuntu 12.0.4 x86_64
> Versions:
> - DSE 4.7.0
> - Cassandra 2.1.3.304
> - Java 1.7.0_45
> DSE Configuration:
> - Xms7540M 
> - Xmx7540M 
> - Xmn800M
> - Ddse.system_cpu_cores=8 -Ddse.system_memory_in_mb=30161 
> - Dcassandra.config.loader=com.datastax.bdp.config.DseConfigurationLoader 
> - ea -javaagent:/usr/local/lib/dse/resources/cassandra/lib/jamm-0.3.0.jar 
> - XX:+CMSClassUnloadingEnabled -XX:+UseThreadPriorities 
> - XX:ThreadPriorityPolicy=42 -Xms7540M -Xmx7540M -Xmn800M 
> - XX:+HeapDumpOnOutOfMemoryError -Xss256k 
> - XX:StringTableSize=1000003 -XX:+UseParNewGC 
> - XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled 
> - XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 
> - XX:CMSInitiatingOccupancyFraction=75 
> - XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB
>            Reporter: Rocco Varela
>            Assignee: Benedict
>             Fix For: 2.1.4
>
>         Attachments: 9021-improvereporting.txt, 9021.txt, system.log
>
>
> After ~3 hours of data ingestion we see assertion errors and 'LEAK DETECTED' 
> errors during what looks like sstable compaction.
> system.log snippets (full log attached):
> {code}
> ...
> INFO  [CompactionExecutor:12] 2015-03-23 02:45:51,770  
> CompactionTask.java:267 - Compacted 4 sstables to 
> [/mnt/cass_data_disks/data1/requests_ks/timeline-       
> 9500fe40d0f611e495675d5ea01541b5/requests_ks-timeline-ka-185,].  65,916,594 
> bytes to 66,159,512 (~100% of original) in 26,554ms = 2.376087MB/s.  983 
> total       partitions merged to 805.  Partition merge counts were {1:627, 
> 2:178, }
> INFO  [CompactionExecutor:11] 2015-03-23 02:45:51,837  
> CompactionTask.java:267 - Compacted 4 sstables to 
> [/mnt/cass_data_disks/data1/system/                     
> compactions_in_progress-55080ab05d9c388690a4acb25fe1f77b/system-compactions_in_progress-ka-119,].
>   426 bytes to 42 (~9% of original) in 82ms = 0.000488MB/s.  5  total 
> partitions merged to 1.  Partition merge counts were {1:1, 2:2, }
> ERROR [NonPeriodicTasks:1] 2015-03-23 02:45:52,251  CassandraDaemon.java:167 
> - Exception in thread Thread[NonPeriodicTasks:1,5,main]
> java.lang.AssertionError: null
>  at 
> org.apache.cassandra.io.compress.CompressionMetadata$Chunk.<init>(CompressionMetadata.java:438)
>  ~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at 
> org.apache.cassandra.io.compress.CompressionMetadata.chunkFor(CompressionMetadata.java:228)
>  ~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at 
> org.apache.cassandra.io.util.CompressedPoolingSegmentedFile.dropPageCache(CompressedPoolingSegmentedFile.java:80)
>  ~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at 
> org.apache.cassandra.io.sstable.SSTableReader$6.run(SSTableReader.java:923) 
> ~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at 
> org.apache.cassandra.io.sstable.SSTableReader$InstanceTidier$1.run(SSTableReader.java:2036)
>  ~[cassandra-all-2.1.3.304.jar:2.1.3.304]
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 
> ~[na:1.7.0_45]
>  at java.util.concurrent.FutureTask.run(FutureTask.java:262) ~[na:1.7.0_45]
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
>  ~[na:1.7.0_45]
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
>  ~[na:1.7.0_45]
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>  ~[na:1.7.0_45]
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>  [na:1.7.0_45]
>  at java.lang.Thread.run(Thread.java:744) [na:1.7.0_45]
> ...
> INFO  [MemtableFlushWriter:50] 2015-03-23 02:47:29,465  Memtable.java:378 - 
> Completed flushing /mnt/cass_data_disks/data1/requests_ks/timeline-           
>        9500fe40d0f611e495675d5ea01541b5/requests_ks-timeline-ka-188-Data.db 
> (16311981 bytes) for commitlog position 
> ReplayPosition(segmentId=1427071574495,             position=4523631)
> ERROR [Reference-Reaper:1] 2015-03-23 02:47:33,987  Ref.java:181 - LEAK 
> DETECTED: a reference 
> (org.apache.cassandra.utils.concurrent.Ref$State@2f59b10) to class 
> org.apache.cassandra.io.sstable.SSTableReader$DescriptorTypeTidy@1251424500:/mnt/cass_data_disks/data1/requests_ks/timeline-9500fe40d0f611e495675d5ea01541b5/
>     requests_ks-timeline-ka-149 was not released before the reference was 
> garbage collected
> INFO  [Service Thread] 2015-03-23 02:47:40,158  GCInspector.java:142 - 
> ConcurrentMarkSweep GC in 12247ms.  CMS Old Gen: 5318987136 -> 457655168; CMS 
> Perm Gen:   44731264 -> 44699160; Par Eden Space: 8597912 -> 418006664; Par 
> Survivor Space: 71865728 -> 59679584
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to