Re: Public Interface Failure in Multiple DC setup

2016-08-11 Thread Anuj Wadehra
Hi 
Can someone take these questions?
ThanksAnuj

 
 
 On Thu, 11 Aug, 2016 at 8:30 PM, Anuj Wadehra wrote:  
Hi,

Setup: Cassandra 2.0.14 with PropertyFileSnitch. 2 Data Centers. 
Every node has broadcast address= Public IP (bond0) & listen address=Private IP 
(bond1).

As per DataStax docs,
(https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html),

"For intra-network or region traffic, Cassandra switches to the private IP 
after establishing a connection". 

This means that even for traffic within a DC, Cassandra would contact a node by 
Broadcast Address i.e. Public IP and then switch to Private IP.

Query:

If we shut down bond0 (public interface) on a node:

1. Will read/write requests from coordinators in local DC be routed to the node 
on listen address/private IP/bond1 or will it be treated as DOWN ?

2. Will gossip be able to discover the node. If the node is using its private 
interface (bond1) to send Gossip messages to other nodes on public/broadcast 
address, will other nodes in local and remote DC see the node (with bond0 down) 
as UP?

I am aware that https://issues.apache.org/jira/browse/CASSANDRA-9748 is an open 
issue in 2.0.14. 
But even in later releases, I am interested in the behavior when public 
interface is down and PropertyFileSnitch is used.

Thanks
Anuj
  


Corrupt SSTABLE over and over

2016-08-11 Thread Alaa Zubaidi (PDF)
Hi,

I have a 16 Node cluster, Cassandra 2.2.1 on Windows, local installation
(NOT on the cloud)

and I am getting
Error [CompactionExecutor:2] 2016-08-12 06:51:52, 983 Cassandra
Daemon.java:183 - Execption in thread Thread[CompactionExecutor:2,1main]
org.apache.cassandra.io.FSReaderError:
org.apache.cassandra.io.sstable.CorruptSSTableExecption:
org.apache.cassandra.io.compress.CurrptBlockException:
(E:\\la-4886-big-Data.db): corruption detected, chunk at 4969092 of
length 10208.
at
org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:357)
~[apache-cassandra-2.2.1.jar:2.2.1]


ERROR [CompactionExecutor:2] ... FileUtils.java:463 - Existing
forcefully due to file system exception on startup, disk failure policy
"stop"

I tried sstablescrub but it crashed with hs-err-pid-...
I removed the corrupted file and started the Node again, after one day the
corruption came back again, I removed the files, and restarted Cassandra,
it worked for few days, then I ran "nodetool repair" after it finished,
Cassandra failed again but with commitlog corruption, after removing the
commitlog files, it failed again with another sstable corruption.

I was also checking the HW, file system, and memory, the VMware logs showed
no HW error, also the HW management logs showed NO problems or issues.
Also checked the Windows Logs (Application and System) the only thing I
found is on the system logs "Cassandra Service terminated with
service-specific error Cannot create another system semaphore.

I could not find any thing regarding that error, all comments point to
application log.

Any help is appreciated..

-- 

Alaa Zubaidi

-- 
*This message may contain confidential and privileged information. If it 
has been sent to you in error, please reply to advise the sender of the 
error and then immediately permanently delete it and all attachments to it 
from your systems. If you are not the intended recipient, do not read, 
copy, disclose or otherwise use this message or any attachments to it. The 
sender disclaims any liability for such unauthorized use. PLEASE NOTE that 
all incoming e-mails sent to PDF e-mail accounts will be archived and may 
be scanned by us and/or by external service providers to detect and prevent 
threats to our systems, investigate illegal or inappropriate behavior, 
and/or eliminate unsolicited promotional e-mails (“spam”). If you have any 
concerns about this process, please contact us at *
*legal.departm...@pdf.com* *.*


Re: nodetool repair with -pr and -dc

2016-08-11 Thread kurt Greaves
-D does not do what you think it does. I've quoted the relevant
documentation from the README:

>
> Multiple
> Datacenters
>
> If you have multiple datacenters in your ring, then you MUST specify the
> name of the datacenter containing the node you are repairing as part of the
> command-line options (--datacenter=DCNAME). Failure to do so will result in
> only a subset of your data being repaired (approximately
> data/number-of-datacenters). This is because nodetool has no way to
> determine the relevant DC on its own, which in turn means it will use the
> tokens from every ring member in every datacenter.
>


On 11 August 2016 at 12:24, Paulo Motta  wrote:

> > if we want to use -pr option ( which i suppose we should to prevent
> duplicate checks) in 2.0 then if we run the repair on all nodes in a single
> DC then it should be sufficient and we should not need to run it on all
> nodes across DC's?
>
> No, because the primary ranges of the nodes in other DCs will be missing
> repair, so you should either run with -pr in all nodes in all DCs, or
> restrict repair to a specific DC with -local (and have duplicate checks).
> Combined -pr and -local are only supported on 2.1
>
>
> 2016-08-11 1:29 GMT-03:00 Anishek Agarwal :
>
>> ok thanks, so if we want to use -pr option ( which i suppose we should to
>> prevent duplicate checks) in 2.0 then if we run the repair on all nodes in
>> a single DC then it should be sufficient and we should not need to run it
>> on all nodes across DC's ?
>>
>>
>>
>> On Wed, Aug 10, 2016 at 5:01 PM, Paulo Motta 
>> wrote:
>>
>>> On 2.0 repair -pr option is not supported together with -local, -hosts
>>> or -dc, since it assumes you need to repair all nodes in all DCs and it
>>> will throw and error if you try to run with nodetool, so perhaps there's
>>> something wrong with range_repair options parsing.
>>>
>>> On 2.1 it was added support to simultaneous -pr and -local options on
>>> CASSANDRA-7450, so if you need that you can either upgade to 2.1 or
>>> backport that to 2.0.
>>>
>>>
>>> 2016-08-10 5:20 GMT-03:00 Anishek Agarwal :
>>>
 Hello,

 We have 2.0.17 cassandra cluster(*DC1*) with a cross dc setup with a
 smaller cluster(*DC2*).  After reading various blogs about
 scheduling/running repairs looks like its good to run it with the following


 -pr for primary range only
 -st -et for sub ranges
 -par for parallel
 -dc to make sure we can schedule repairs independently on each Data
 centre we have.

 i have configured the above using the repair utility @
 https://github.com/BrianGallew/cassandra_range_repair.git

 which leads to the following command :

 ./src/range_repair.py -k [keyspace] -c [columnfamily name] -v -H
 localhost -p -D* DC1*

 but looks like the merkle tree is being calculated on nodes which are
 part of other *DC2.*

 why does this happen? i thought it should only look at the nodes in
 local cluster. however on nodetool the* -pr* option cannot be used
 with *-local* according to docs @https://docs.datastax.com/en/
 cassandra/2.0/cassandra/tools/toolsRepair.html

 so i am may be missing something, can someone help explain this please.

 thanks
 anishek

>>>
>>>
>>
>


-- 
Kurt Greaves
k...@instaclustr.com
www.instaclustr.com


Re: migrating from 2.1.2 to 3.0.8 log errors

2016-08-11 Thread Adil
After migrating C* from 2.1.2 to 3.0.8, all queries with the where
condition involved ad indexed column return zero rows for the old data,
instead news inserted data are returned from the same query, I'm guessing
that something was remained incomplete about indexes, should we run rebuild
indexes? Any idea?
Thank
Ad.

Il 10/ago/2016 23:58, "Adil"  ha scritto:

> Thank you for your response, we have updated datastax driver to 3.1.0
> using V3 protocol, i think there are still some webapp that still using the
> 2.1.6 java driver..we will upgrade thembut we noticed strange things,
> on web apps upgraded to 3.1.0 some queries return zero results even if data
> exists, I can see it with cqlsh
>
> 2016-08-10 20:48 GMT+02:00 Tyler Hobbs :
>
>> That just means that a client/driver disconnected.  Those log messages
>> are supposed to be suppressed, but perhaps that stopped working in 3.x due
>> to another change.
>>
>> On Wed, Aug 10, 2016 at 10:33 AM, Adil  wrote:
>>
>>> Hi guys,
>>> We have migrated our cluster (5 nodes in DC1 and 5 nodes in DC2) from
>>> cassandra 2.1.2 to 3.0.8, all seems fine, executing nodetool status shows
>>> all nodes UN, but in each node's log there is this log error continuously:
>>> java.io.IOException: Error while read(...): Connection reset by peer
>>> at io.netty.channel.epoll.Native.readAddress(Native Method)
>>> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
>>> at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.
>>> doReadBytes(EpollSocketChannel.java:675) ~[netty-all-4.0.23.Final.jar:4
>>> .0.23.Final]
>>> at io.netty.channel.epoll.EpollSocketChannel$EpollSocketUnsafe.
>>> epollInReady(EpollSocketChannel.java:714) ~[netty-all-4.0.23.Final.jar:4
>>> .0.23.Final]
>>> at 
>>> io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:326)
>>> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
>>> at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:264)
>>> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
>>> at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sin
>>> gleThreadEventExecutor.java:116) ~[netty-all-4.0.23.Final.jar:4
>>> .0.23.Final]
>>> at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnabl
>>> eDecorator.run(DefaultThreadFactory.java:137)
>>> ~[netty-all-4.0.23.Final.jar:4.0.23.Final]
>>> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_101]
>>>
>>> we have installed java-8_101
>>>
>>> anya idea what woud be the problem?
>>>
>>> thanks
>>>
>>> Adil
>>> does anyone
>>>
>>>
>>
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>
>


Re: Question nodetool status

2016-08-11 Thread jean paul
Hi, thanks a lot for answer :)

Gossip is a peer-to-peer communication protocol in which nodes periodically
exchange state information about themselves and about other nodes they know
about.

unreachableNodes = probe.getUnreachableNodes();   --->  i.e if node don't
publish heartbeats on x seconds (using gossip protocol), it's therefore
marked 'DN: down' ?


That's it?





2016-08-11 13:51 GMT+01:00 Romain Hardouin :

> Hi Jean Paul,
>
> Yes, the gossiper is used. Example with down nodes:
> 1. The status command retrieve unreachable nodes from a NodeProbe
> instance: https://github.com/apache/cassandra/blob/trunk/
> src/java/org/apache/cassandra/tools/nodetool/Status.java#L64
> 2. The NodeProbe list comes from a StorageService proxy:
> https://github.com/apache/cassandra/blob/trunk/src/java/
> org/apache/cassandra/tools/NodeProbe.java#L438
> 3. The proxy calls the Gossiper singleton: https://github.com/
> apache/cassandra/blob/trunk/src/java/org/apache/cassandra/
> service/StorageService.java#L2681
>
> Best,
>
> Romain
>
> Le Jeudi 11 août 2016 14h16, jean paul  a écrit :
>
>
> Hi all,
>
>
>
> *$nodetool status*Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/ Moving
> --  AddressLoad   Tokens  Owns (effective)  Host
> IDRack
>
>
> *UN  127.0.0.1  83.05 KB   256 100.0%
> 460ddcd9-1ee8-48b8-a618- c076056aad07  rack1*
> *The nodetool command* shows the status of the node (UN=up, DN=down):
>
> Please i'd like to know how this command works and is it based on gossip
> protocol or not ?
>
> Thank you so much for explanations.
> Best regards.
>
>
>
>
>


Public Interface Failure in Multiple DC setup

2016-08-11 Thread Anuj Wadehra

Hi,

Setup: Cassandra 2.0.14 with PropertyFileSnitch. 2 Data Centers. 
Every node has broadcast address= Public IP (bond0) & listen address=Private IP 
(bond1).

As per DataStax docs,
(https://docs.datastax.com/en/cassandra/2.0/cassandra/configuration/configMultiNetworks.html),

"For intra-network or region traffic, Cassandra switches to the private IP 
after establishing a connection". 

This means that even for traffic within a DC, Cassandra would contact a node by 
Broadcast Address i.e. Public IP and then switch to Private IP.

Query:

If we shut down bond0 (public interface) on a node:

1. Will read/write requests from coordinators in local DC be routed to the node 
on listen address/private IP/bond1 or will it be treated as DOWN ?

2. Will gossip be able to discover the node. If the node is using its private 
interface (bond1) to send Gossip messages to other nodes on public/broadcast 
address, will other nodes in local and remote DC see the node (with bond0 down) 
as UP?

I am aware that https://issues.apache.org/jira/browse/CASSANDRA-9748 is an open 
issue in 2.0.14. 
But even in later releases, I am interested in the behavior when public 
interface is down and PropertyFileSnitch is used.

Thanks
Anuj


Re: Question nodetool status

2016-08-11 Thread Romain Hardouin
Hi Jean Paul,
Yes, the gossiper is used. Example with down nodes:1. The status command 
retrieve unreachable nodes from a NodeProbe instance: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/nodetool/Status.java#L64
2. The NodeProbe list comes from a StorageService proxy: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/tools/NodeProbe.java#L4383.
 The proxy calls the Gossiper singleton: 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L2681
 
Best,
Romain

Le Jeudi 11 août 2016 14h16, jean paul  a écrit :
 

 Hi all, 

$nodetool status

Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/ Moving
--  Address    Load   Tokens  Owns (effective)  Host ID 
       Rack
UN  127.0.0.1  83.05 KB   256 100.0%    460ddcd9-1ee8-48b8-a618- 
c076056aad07  rack1

The nodetool command shows the status of the node (UN=up,DN=down):
Please i'd like to know how this command works and is it based on gossip 
protocol or not ?

Thank you so much for explanations.Best regards. 




  

Re: nodetool repair with -pr and -dc

2016-08-11 Thread Paulo Motta
> if we want to use -pr option ( which i suppose we should to prevent
duplicate checks) in 2.0 then if we run the repair on all nodes in a single
DC then it should be sufficient and we should not need to run it on all
nodes across DC's?

No, because the primary ranges of the nodes in other DCs will be missing
repair, so you should either run with -pr in all nodes in all DCs, or
restrict repair to a specific DC with -local (and have duplicate checks).
Combined -pr and -local are only supported on 2.1


2016-08-11 1:29 GMT-03:00 Anishek Agarwal :

> ok thanks, so if we want to use -pr option ( which i suppose we should to
> prevent duplicate checks) in 2.0 then if we run the repair on all nodes in
> a single DC then it should be sufficient and we should not need to run it
> on all nodes across DC's ?
>
>
>
> On Wed, Aug 10, 2016 at 5:01 PM, Paulo Motta 
> wrote:
>
>> On 2.0 repair -pr option is not supported together with -local, -hosts or
>> -dc, since it assumes you need to repair all nodes in all DCs and it will
>> throw and error if you try to run with nodetool, so perhaps there's
>> something wrong with range_repair options parsing.
>>
>> On 2.1 it was added support to simultaneous -pr and -local options on
>> CASSANDRA-7450, so if you need that you can either upgade to 2.1 or
>> backport that to 2.0.
>>
>>
>> 2016-08-10 5:20 GMT-03:00 Anishek Agarwal :
>>
>>> Hello,
>>>
>>> We have 2.0.17 cassandra cluster(*DC1*) with a cross dc setup with a
>>> smaller cluster(*DC2*).  After reading various blogs about
>>> scheduling/running repairs looks like its good to run it with the following
>>>
>>>
>>> -pr for primary range only
>>> -st -et for sub ranges
>>> -par for parallel
>>> -dc to make sure we can schedule repairs independently on each Data
>>> centre we have.
>>>
>>> i have configured the above using the repair utility @
>>> https://github.com/BrianGallew/cassandra_range_repair.git
>>>
>>> which leads to the following command :
>>>
>>> ./src/range_repair.py -k [keyspace] -c [columnfamily name] -v -H
>>> localhost -p -D* DC1*
>>>
>>> but looks like the merkle tree is being calculated on nodes which are
>>> part of other *DC2.*
>>>
>>> why does this happen? i thought it should only look at the nodes in
>>> local cluster. however on nodetool the* -pr* option cannot be used with
>>> *-local* according to docs @https://docs.datastax.com/en/
>>> cassandra/2.0/cassandra/tools/toolsRepair.html
>>>
>>> so i am may be missing something, can someone help explain this please.
>>>
>>> thanks
>>> anishek
>>>
>>
>>
>


Question nodetool status

2016-08-11 Thread jean paul
Hi all,



*$nodetool status*Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  AddressLoad   Tokens  Owns (effective)  Host
ID   Rack


*UN  127.0.0.1  83.05 KB   256 100.0%
460ddcd9-1ee8-48b8-a618-c076056aad07  rack1*

*The nodetool command* shows the status of the node (UN=up, DN=down):


Please i'd like to know how this command works and is it based on gossip
protocol or not ?


Thank you so much for explanations.

Best regards.


Re: JVM Crash on 3.0.6

2016-08-11 Thread Stefano Ortolani
Not really related, but know that on 12.04 I had to disable jemalloc,
otherwise nodes would randomly die at startup (
https://issues.apache.org/jira/browse/CASSANDRA-11723)

Regards,
Stefano

On Thu, Aug 11, 2016 at 10:28 AM, Riccardo Ferrari 
wrote:

> Hi C* users,
>
> In recent time I had couple of my nodes crashing (on different dates). I
> don't have core dumps however my JVM crash logs goes like this:
> ===
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7f8f608c8e40, pid=6916, tid=140253195458304
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # C  [liblz4-java6471621810388748482.so+0x5e40]  LZ4_decompress_fast+0xa0
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> ...
> ---  T H R E A D  ---
>
>
> Current thread (0x7f8f5c7b2d50):  JavaThread
> "CompactionExecutor:11952" daemon [_thread_in_native, id=16219,
> stack(0x7f8f3de0d000,0x7f8f3de4e000)]
> ...
> Stack: [0x7f8f3de0d000,0x7f8f3de4e000],  sp=0x7f8f3de4c0e0,
>  free space=252k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> C  [liblz4-java6471621810388748482.so+0x5e40]  LZ4_decompress_fast+0xa0
>
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> J 4150  net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast([BLjava/nio/
> ByteBuffer;I[BLjava/nio/ByteBuffer;II)I (0 bytes) @ 0x7f8f791e4723
> [0x7f8f791e4680+0xa3]
> J 19836 C2 
> org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap()V
> (354 bytes) @ 0x7f8f7b714930 [0x7f8f7b714320+0x610]
> J 6662 C2 org.apache.cassandra.db.columniterator.
> AbstractSSTableIterator.(Lorg/apache/cassandra/io/
> sstable/format/SSTableReader;Lorg/apache/cassandra/io/util/
> FileDataInput;Lorg/apache/cassandra/db/DecoratedKey;
> Lorg/apache/cassandra/db/RowIndexEntry;Lorg/apache
> /cassandra/db/filter/ColumnFilter;Z)V (389 bytes) @ 0x7f8f79c1cdb8
> [0x7f8f79c1c500+0x8b8]
> J 22393 C2 org.apache.cassandra.db.SinglePartitionReadCommand.
> queryMemtableAndDiskInternal(Lorg/apache/cassandra/db/
> ColumnFamilyStore;Z)Lorg/apache/cassandra/db/rows/UnfilteredRowIterator;
> (818 bytes) @ 0x7f8f7c1d4364 [0x7f8f7c1d2f40+0x1424]
> J 22166 C1 org.apache.cassandra.db.Keyspace.indexPartition(Lorg/
> apache/cassandra/db/DecoratedKey;Lorg/apache/cassandra/db/
> ColumnFamilyStore;Ljava/util/Set;)V (274 bytes) @ 0x7f8f7beb6304
> [0x7f8f7beb5420+0xee4]
> j  org.apache.cassandra.index.SecondaryIndexBuilder.build()V+46
> j  org.apache.cassandra.db.compaction.CompactionManager$11.run()V+18
> J 22293 C2 java.util.concurrent.ThreadPoolExecutor.runWorker(
> Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V (225 bytes) @
> 0x7f8f7b17727c [0x7f8f7b176da0+0x4dc]
> J 21302 C2 java.lang.Thread.run()V (17 bytes) @ 0x7f8f79fe59f8
> [0x7f8f79fe59a0+0x58]
> v  ~StubRoutines::call_stub
> ...
> VM state:not at safepoint (normal execution)
>
> VM Mutex/Monitor currently owned by a thread: None
>
> Heap:
>  par new generation   total 368640K, used 123009K [0x0006d5e0,
> 0x0006eee0, 0x0006eee0)
>   eden space 327680K,  34% used [0x0006d5e0, 0x0006dcaf35c8,
> 0x0006e9e0)
>   from space 40960K,  27% used [0x0006e9e0, 0x0006ea92cf00,
> 0x0006ec60)
>   to   space 40960K,   0% used [0x0006ec60, 0x0006ec60,
> 0x0006eee0)
>  concurrent mark-sweep generation total 3426304K, used 1288977K
> [0x0006eee0, 0x0007c000, 0x0007c000)
>  Metaspace   used 41685K, capacity 42832K, committed 43156K, reserved
> 1087488K
>   class spaceused 4455K, capacity 4702K, committed 4756K, reserved
> 1048576K
> ...
> OS:DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=12.04
> DISTRIB_CODENAME=precise
> DISTRIB_DESCRIPTION="Ubuntu 12.04.1 LTS"
>
> uname:Linux 3.2.0-35-virtual #55-Ubuntu SMP Wed Dec 5 18:02:05 UTC 2012
> x86_64
> libc:glibc 2.15 NPTL 2.15
> rlimit: STACK 8192k, CORE 0k, NPROC 119708, NOFILE 10, AS infinity
> load average:2.96 1.08 0.60
>
> What am I missing?
> Both crashes seems to happen during compaction and when running native
> code (LZ4).
> Both crashes happens when the nodes are doing scheduled repair (so under
> increased load).
> Machines are 4vCPUs and 15GB ram (m1.xlarge)
> Any hint?
>
> Best,
>


JVM Crash on 3.0.6

2016-08-11 Thread Riccardo Ferrari
Hi C* users,

In recent time I had couple of my nodes crashing (on different dates). I
don't have core dumps however my JVM crash logs goes like this:
===
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7f8f608c8e40, pid=6916, tid=140253195458304
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build
1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# C  [liblz4-java6471621810388748482.so+0x5e40]  LZ4_decompress_fast+0xa0
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
...
---  T H R E A D  ---


Current thread (0x7f8f5c7b2d50):  JavaThread "CompactionExecutor:11952"
daemon [_thread_in_native, id=16219,
stack(0x7f8f3de0d000,0x7f8f3de4e000)]
...
Stack: [0x7f8f3de0d000,0x7f8f3de4e000],  sp=0x7f8f3de4c0e0,
 free space=252k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
C  [liblz4-java6471621810388748482.so+0x5e40]  LZ4_decompress_fast+0xa0

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
J 4150
 
net.jpountz.lz4.LZ4JNI.LZ4_decompress_fast([BLjava/nio/ByteBuffer;I[BLjava/nio/ByteBuffer;II)I
(0 bytes) @ 0x7f8f791e4723 [0x7f8f791e4680+0xa3]
J 19836 C2
org.apache.cassandra.io.compress.CompressedRandomAccessReader.reBufferMmap()V
(354 bytes) @ 0x7f8f7b714930 [0x7f8f7b714320+0x610]
J 6662 C2
org.apache.cassandra.db.columniterator.AbstractSSTableIterator.(Lorg/apache/cassandra/io/sstable/format/SSTableReader;Lorg/apache/cassandra/io/util/FileDataInput;Lorg/apache/cassandra/db/DecoratedKey;Lorg/apache/cassandra/db/RowIndexEntry;Lorg/apache
/cassandra/db/filter/ColumnFilter;Z)V (389 bytes) @ 0x7f8f79c1cdb8
[0x7f8f79c1c500+0x8b8]
J 22393 C2
org.apache.cassandra.db.SinglePartitionReadCommand.queryMemtableAndDiskInternal(Lorg/apache/cassandra/db/ColumnFamilyStore;Z)Lorg/apache/cassandra/db/rows/UnfilteredRowIterator;
(818 bytes) @ 0x7f8f7c1d4364 [0x7f8f7c1d2f40+0x1424]
J 22166 C1
org.apache.cassandra.db.Keyspace.indexPartition(Lorg/apache/cassandra/db/DecoratedKey;Lorg/apache/cassandra/db/ColumnFamilyStore;Ljava/util/Set;)V
(274 bytes) @ 0x7f8f7beb6304 [0x7f8f7beb5420+0xee4]
j  org.apache.cassandra.index.SecondaryIndexBuilder.build()V+46
j  org.apache.cassandra.db.compaction.CompactionManager$11.run()V+18
J 22293 C2
java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V
(225 bytes) @ 0x7f8f7b17727c [0x7f8f7b176da0+0x4dc]
J 21302 C2 java.lang.Thread.run()V (17 bytes) @ 0x7f8f79fe59f8
[0x7f8f79fe59a0+0x58]
v  ~StubRoutines::call_stub
...
VM state:not at safepoint (normal execution)

VM Mutex/Monitor currently owned by a thread: None

Heap:
 par new generation   total 368640K, used 123009K [0x0006d5e0,
0x0006eee0, 0x0006eee0)
  eden space 327680K,  34% used [0x0006d5e0, 0x0006dcaf35c8,
0x0006e9e0)
  from space 40960K,  27% used [0x0006e9e0, 0x0006ea92cf00,
0x0006ec60)
  to   space 40960K,   0% used [0x0006ec60, 0x0006ec60,
0x0006eee0)
 concurrent mark-sweep generation total 3426304K, used 1288977K
[0x0006eee0, 0x0007c000, 0x0007c000)
 Metaspace   used 41685K, capacity 42832K, committed 43156K, reserved
1087488K
  class spaceused 4455K, capacity 4702K, committed 4756K, reserved
1048576K
...
OS:DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=12.04
DISTRIB_CODENAME=precise
DISTRIB_DESCRIPTION="Ubuntu 12.04.1 LTS"

uname:Linux 3.2.0-35-virtual #55-Ubuntu SMP Wed Dec 5 18:02:05 UTC 2012
x86_64
libc:glibc 2.15 NPTL 2.15
rlimit: STACK 8192k, CORE 0k, NPROC 119708, NOFILE 10, AS infinity
load average:2.96 1.08 0.60

What am I missing?
Both crashes seems to happen during compaction and when running native code
(LZ4).
Both crashes happens when the nodes are doing scheduled repair (so under
increased load).
Machines are 4vCPUs and 15GB ram (m1.xlarge)
Any hint?

Best,