Re: 3.11.2 memory leak

2018-07-22 Thread kurt greaves
Likely in the next few weeks.

On Mon., 23 Jul. 2018, 01:17 Abdul Patel,  wrote:

> Any idea when 3.11.3 is coming in?
>
> On Tuesday, June 19, 2018, kurt greaves  wrote:
>
>> At this point I'd wait for 3.11.3. If you can't, you can get away with
>> backporting a few repair fixes or just doing sub range repairs on 3.11.2
>>
>> On Wed., 20 Jun. 2018, 01:10 Abdul Patel,  wrote:
>>
>>> Hi All,
>>>
>>> Do we kmow whats the stable version for now if u wish to upgrade ?
>>>
>>> On Tuesday, June 5, 2018, Steinmaurer, Thomas <
>>> thomas.steinmau...@dynatrace.com> wrote:
>>>
 Jeff,



 FWIW, when talking about
 https://issues.apache.org/jira/browse/CASSANDRA-13929, there is a
 patch available since March without getting further attention.



 Regards,

 Thomas



 *From:* Jeff Jirsa [mailto:jji...@gmail.com]
 *Sent:* Dienstag, 05. Juni 2018 00:51
 *To:* cassandra 
 *Subject:* Re: 3.11.2 memory leak



 There have been a few people who have reported it, but nobody (yet) has
 offered a patch to fix it. It would be good to have a reliable way to
 repro, and/or an analysis of a heap dump demonstrating the problem (what's
 actually retained at the time you're OOM'ing).



 On Mon, Jun 4, 2018 at 6:52 AM, Abdul Patel 
 wrote:

 Hi All,



 I recently upgraded my non prod cluster from 3.10 to 3.11.2.

 It was working fine for a 1.5 weeks then suddenly nodetool info startee
 reporting 80% and more memory consumption.

 Intially it was 16gb configured, then i bumped to 20gb and rebooted all
 4 nodes of cluster-single DC.

 Now after 8 days i again see 80% + usage and its 16gb and above ..which
 we never saw before .

 Seems like memory leak bug?

 Does anyone has any idea ? Our 3.11.2 release rollout has been halted
 because of this.

 If not 3.11.2 whats the next best stable release we have now?


 The contents of this e-mail are intended for the named addressee only.
 It contains information that may be confidential. Unless you are the named
 addressee or an authorized designee, you may not copy or use it, or
 disclose it to anyone else. If you received it in error please notify us
 immediately and then destroy it. Dynatrace Austria GmbH (registration
 number FN 91482h) is a company registered in Linz whose registered office
 is at 4040 Linz, Austria, Freistädterstraße 313

>>>


Re: which driver to use with cassandra 3

2018-07-22 Thread Nate McCall
Due to how Spring Data binding works, you have to write queries
explicitly to use the "...FROM keyspace.table ..." in either the
template-method classes (CqlTemplate, etc) or via @Query annontations
to avoid the 'use keyspace' overhead. For example, a Repository
implementation for a User class (do this all by hand, do not use
CassandraRepository per Patrick's point of it being a traveling
carnival of cassandra anti-patterns) would look something like:

@Query("select id, email, name from myusers.user where id = ?0")
User findById(UUID id);

Another important note - only the template method classes that work
*directly* with prepared statements use them. In other words: *nothing
else in the API uses prepared statements.* And this is a massive
performance hit in statement parsing alone. There are open issues for
this in the SD jira:
https://jira.spring.io/browse/DATACASS-578
https://jira.spring.io/browse/DATACASS-510

If you stick to the cqlTemplate methods for working with
PreparedStatements and ResultSet extractors, etc, because you want
Spring to manage all the configuration, that's totally legit and it
will work well.

In general, this will be an good API one day as some of the Fluent
stuff for working with paged results sets is particularly excellent
and well crafted around modern Java paradigms (outside of not using
PreparedStatement unfortunately).

On Sun, Jul 22, 2018 at 1:15 PM, Goutham reddy
 wrote:
> Hi,
> Consider overriding default java driver provided by spring boot if you are
> using Datastax clusters with with any of the 3.X Datastax driver. I agree to
> Patrick, always have one key space specified to one application in that way
> you achieve domain driven applications and cause less overhead avoiding
> switching between key spaces.
>
> Cheers,
> Goutham
>
> On Fri, Jul 20, 2018 at 10:10 AM Patrick McFadin  wrote:
>>
>> Vitaliy,
>>
>> The DataStax Java driver is very actively maintained by a good size team
>> and a lot of great community contributors. It's version 3.x compatible and
>> even has some 4.x features starting to creep in. Support for virtual tables
>> (https://issues.apache.org/jira/browse/CASSANDRA-7622)  was just merged as
>> an example. Even the largest DataStax customers have a mix of enterprise +
>> OSS and we want to support them either way. Giving developers the most
>> consistent experience is part of that goal.
>>
>> As for spring-data-cassandra, it does pull the latest driver as a part of
>> its own build, so you will already have it in your classpath. Spring adds
>> some auto-magic that you should be aware. The part you mentioned about the
>> schema management, is one to be careful with using. If you use it in dev,
>> it's not a huge problem. If it gets out to prod, you could potentially have
>> A LOT of concurrent schema changes happening which can lead to bad things.
>> Also, some of the spring API features such as findAll() can expose typical
>> c* anti-patterns such as "allow filtering" Just be aware of what feature
>> does what. And finally, another potential production problem is that if you
>> use a lot of keyspaces, Spring will instantiate a new Driver Session object
>> per keyspace which can lead to a lot of redundant connection to the
>> database. From the driver, a better way is to specify a keyspace per query.
>>
>> As you are using spring-data-cassandra, please share your experiences if
>> you can. There are a lot of developers that would benefit from some
>> real-world stories.
>>
>> Patrick
>>
>>
>> On Fri, Jul 20, 2018 at 4:54 AM Vitaliy Semochkin 
>> wrote:
>>>
>>> Thank you very much Duy Hai Doan!
>>> I have relatively simple demands and since spring using datastax
>>> driver I can always get back to it,
>>> though  I would prefer to use spring in order to do bootstrapping and
>>> resource management for me.
>>> On Fri, Jul 20, 2018 at 4:51 PM DuyHai Doan  wrote:
>>> >
>>> > Spring data cassandra is so so ... It has less features (at last at the
>>> > time I looked at it) than the default Java driver
>>> >
>>> > For driver, right now most of people are using Datastax's ones
>>> >
>>> > On Fri, Jul 20, 2018 at 3:36 PM, Vitaliy Semochkin
>>> >  wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> Which driver to use with cassandra 3
>>> >>
>>> >> the one that is provided by datastax, netflix or something else.
>>> >>
>>> >> Spring uses driver from datastax, though is it a reliable solution for
>>> >> a long term project, having in mind that datastax and cassandra
>>> >> parted?
>>> >>
>>> >> Regards,
>>> >> Vitaliy
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> >> For additional commands, e-mail: user-h...@cassandra.apache.org
>>> >>
>>> >
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
> --
> 

IllegalArgumentException while saving rdd after repartition by cassandra replica set using cassandra spark connector

2018-07-22 Thread M Singh
Hi Folks:
I am working with on a project to save Spark dataframe to cassandra and am 
getting an exception regarding row size not valid (see below). I tried to trace 
the code in the connector and it appears that the row size (3 below) is 
different from the column count (which turns out be 1).  I am trying to follow 
the example from 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md
 with customer having two more fields than just the id as mentioned in the 
example.  In case of the example I think it will work because it has only 1 
column (customer_id) but I need to save additional fields.  I've tried 
searching but have not found any resolution to this.
I am using Spark 2.3.0 and spark-cassandra-connector:2.3.0-s_2.11.
Just a little background - I tried saving the dataframe to cassandra and it 
works.  However, it is very slow.  So I am trying to see if using 
repartitionByCassandraReplica will make it faster.  I've tried various 
combinations of batch rows size, concurrent writers, etc on data frame and it 
is still very slow.  Therefore I am looking at using 
repartitionByCassandraReplica before trying to save it to the cassandra table.  
If there are other options to make saving dataframe to cassandra faster, please 
let me know.
Here is my scenario:
Cassandra table in keyspace test - create table customer ( customer_id text 
primary key, order integer, value integer);Spark shell commands:   import 
com.datastax.spark.connector._  import org.apache.spark.sql.cassandra._  case 
class Customer(customer_id:String, order_id:Int, value:Int)
  val customers = 
Seq(Customer("1",1,1),Customer("2",2,2)).toDF("customer_id","order_id","value") 
 val customersRdd = 
customers.rdd.repartitionByCassandraReplica("test","customers")  
customersRdd.saveToCassandra("test","customer")

At this point I get an exception :
java.lang.IllegalArgumentException: requirement failed: Invalid row size: 3 
instead of 1. at scala.Predef$.require(Predef.scala:224) at 
com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:23)
 at 
com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:12)
 at 
com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:99)
 at 
com.datastax.spark.connector.rdd.partitioner.TokenGenerator.getPartitionKeyBufferFor(TokenGenerator.scala:38)
 at 
com.datastax.spark.connector.rdd.partitioner.ReplicaPartitioner.getPartition(ReplicaPartitioner.scala:70)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)18/07/18 10:27:51 ERROR Executor: 
Exception in task 1.0 in stage 6.0 (TID 4)java.lang.IllegalArgumentException: 
requirement failed: Invalid row size: 3 instead of 1.
Thanks for your help.


Re: IllegalArgumentException while saving RDD to cassandra using Spark Cassandra Connector

2018-07-22 Thread M Singh
 Hi Folks:
Just checking if anyone has any pointers for the cassandra spark connector 
issue I've mentioned :
IllegalArgumentException on executing save after repartition by cassandra 
replica:val customersRdd = 
customers.rdd.repartitionByCassandraReplica("test","customers")  
customersRdd.saveToCassandra("test","customer")

Thanks
On Wednesday, July 18, 2018, 10:45:25 AM PDT, M Singh 
 wrote:  
 
 Hi Cassandra/Spark experts:
I am working with on a project to save Spark dataframe to cassandra and am 
getting an exception regarding row size not valid (see below). I tried to trace 
the code in the connector and it appears that the row size (3 below) is 
different from the column count (which turns out be 1).  I am trying to follow 
the example from 
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md
 with customer having two more fields than just the id as mentioned in the 
example.  In case of the example I think it will work because it has only 1 
column (customer_id) but I need to save additional fields.  I've tried 
searching but have not found any resolution to this.
I am using Spark 2.3.0 and spark-cassandra-connector:2.3.0-s_2.11.
Just a little background - I tried saving the dataframe to cassandra and it 
works.  However, it is very slow.  So I am trying to see if using 
repartitionByCassandraReplica will make it faster.  I've tried various 
combinations of batch rows size, concurrent writers, etc on data frame and it 
is still very slow.  Therefore I am looking at using 
repartitionByCassandraReplica before trying to save it to the cassandra table.  
If there are other options to make saving dataframe to cassandra faster, please 
let me know.
Here is my scenario:
Cassandra table in keyspace test - create table customer ( customer_id text 
primary key, order integer, value integer);Spark shell commands:   import 
com.datastax.spark.connector._  import org.apache.spark.sql.cassandra._  case 
class Customer(customer_id:String, order_id:Int, value:Int)
  val customers = 
Seq(Customer("1",1,1),Customer("2",2,2)).toDF("customer_id","order_id","value") 
 val customersRdd = 
customers.rdd.repartitionByCassandraReplica("test","customers")  
customersRdd.saveToCassandra("test","customer")

At this point I get an exception :
java.lang.IllegalArgumentException: requirement failed: Invalid row size: 3 
instead of 1. at scala.Predef$.require(Predef.scala:224) at 
com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:23)
 at 
com.datastax.spark.connector.writer.SqlRowWriter.readColumnValues(SqlRowWriter.scala:12)
 at 
com.datastax.spark.connector.writer.BoundStatementBuilder.bind(BoundStatementBuilder.scala:99)
 at 
com.datastax.spark.connector.rdd.partitioner.TokenGenerator.getPartitionKeyBufferFor(TokenGenerator.scala:38)
 at 
com.datastax.spark.connector.rdd.partitioner.ReplicaPartitioner.getPartition(ReplicaPartitioner.scala:70)
 at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
 at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) 
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) 
at org.apache.spark.scheduler.Task.run(Task.scala:108) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 
at java.lang.Thread.run(Thread.java:745)18/07/18 10:27:51 ERROR Executor: 
Exception in task 1.0 in stage 6.0 (TID 4)java.lang.IllegalArgumentException: 
requirement failed: Invalid row size: 3 instead of 1.
Thanks for your help.
  

Re: Stumped By Cassandra delays

2018-07-22 Thread Gareth Collins
Hi Shalom,

Thanks very much for the response!

 We are only using batches on one Cassandra partition to improve
performance. Batches are NEVER used in this app across Cassandra partition.
And if you look at the trace
messages I showed, there is only one statement per batch anyway.

In fact, what I see in the trace is that the responses to the writes may be
being held up by the reads. Here is a more complete example which is
consistent
across nodes. We are using datastax client 3.1.2. Note that all the
requests appear to be processed on nio-worker-5 which is suggesting that
this may be all on the one connection
(even though I can see two connections to each C* server from each client):



*2018-07-20 05:32:43,185 [luster1-nio-worker-5] [  ] [
   ] [] ( core.QueryLogger.SLOW) DEBUG   -
[cluster1] [/10.123.4.52:9042 ] Query too slow,
took 9322 ms: [2 bound values] select a, b, c, d from  where
token(a)>? and token(a)<=?; << slow read2018-07-20 05:32:43,185
[luster1-nio-worker-5] [  ] [] [
 ] ( core.QueryLogger.SLOW) DEBUG   - [cluster1]
[/10.123.4.52:9042 ] Query too slow, took 5950 ms:
[1 statements, 6 bound values] BEGIN BATCH INSERT INTO  (a, b,
c, d, e) VALUES (?, ?, ?, ?, ?) using ttl ?; APPLY BATCH; << write response
received immediately after the read2018-07-20 05:32:43,185
[luster1-nio-worker-5] [  ] [] [
 ] ( core.QueryLogger.SLOW) DEBUG   - [cluster1]
[/10.123.4.52:9042 ] Query too slow, took 511 ms:
[1 statements, 6 bound values] BEGIN BATCH INSERT INTO  (a, b,
c, d, e) VALUES (?, ?, ?, ?, ?) using ttl ?; APPLY BATCH; << write response
received immediately after the read*
2018-07-20 05:32:43,607 [luster1-nio-worker-5] [  ] [
 ] [] (   core.QueryLogger.NORMAL) DEBUG   -
[cluster1] [/10.123.4.52:9042] Query completed normally, took 33 ms: [2
bound values] select CustomerID, ds_, data_, AudienceList from
data.customer_b01be157931bcbfa32b7f240a638129d where token(CustomerID)>?
and token(CustomerID)<=?; << normal read
2018-07-20 05:32:45,938 [luster1-nio-worker-5] [  ] [
 ] [] ( core.QueryLogger.SLOW) DEBUG   -
[cluster1] [/10.123.4.52:9042] Query too slow, took 1701 ms: [2 bound
values] select a, b, c, d from  where token(a)>? and
token(a)<=?; << slow read
2018-07-20 05:32:46,257 [luster1-nio-worker-5] [  ] [
 ] [] (   core.QueryLogger.NORMAL) DEBUG   -
[cluster1] [/10.123.4.52:9042] Query completed normally, took 0 ms: [1
statements, 6 bound values] BEGIN BATCH INSERT INTO  (a, b, c,
d, e) VALUES (?, ?, ?, ?, ?) using ttl ?; APPLY BATCH; << normal write – no
overlap with the read
2018-07-20 05:32:46,336 [luster1-nio-worker-5] [  ] [
 ] [] (   core.QueryLogger.NORMAL) DEBUG   -
[cluster1] [/10.123.4.52:9042] Query completed normally, took 30 ms: [2
bound values] select a, b, c, d from  where token(a)>? and
token(a)<=?; << normal read

*2018-07-20 05:32:48,622 [luster1-nio-worker-5] [  ] [
   ] [] ( core.QueryLogger.SLOW) DEBUG   -
[cluster1] [/10.123.4.52:9042 ] Query too slow,
took 1626 ms: [2 bound values] select select a, b, c, d from 
where token(a)>? and token(a)<=?; << slow read2018-07-20 05:32:48,622
[luster1-nio-worker-5] [  ] [] [
 ] ( core.QueryLogger.SLOW) DEBUG   - [cluster1]
[/10.123.4.52:9042 ] Query too slow, took 425 ms:
[1 statements, 6 bound values] BEGIN BATCH INSERT INTO  (a, b,
c, d, e) VALUES (?, ?, ?, ?, ?) using ttl ?; APPLY BATCH; << write appears
immediately after the read*

I would be suggesting some sort of bug on the client holding up the
thread...but I don't know why I would only have a problem on one C* node at
any one time (the clients process reads and writes to other nodes at the
same time without delays).

thanks in advance,
Gareth


On Sun, Jul 22, 2018 at 4:12 AM, shalom sagges 
wrote:

> Hi Gareth,
>
> If you're using batches for multiple partitions, this may be the root
> cause you've been looking for.
>
> https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/
>
> If batches are optimally used and only one node is misbehaving, check if
> NTP on the node is properly synced.
>
> Hope this helps!
>
>
> On Sat, Jul 21, 2018 at 9:31 PM, Gareth Collins <
> gareth.o.coll...@gmail.com> wrote:
>
>> Hello,
>>
>> We are running Cassandra 2.1.14 in AWS, with c5.4xlarge machines
>> (initially these were m4.xlarge) for our cassandra servers and
>> m4.xlarge for our application servers. On one of our clusters having
>> problems we have 6 C* nodes and 6 AS nodes (two nodes for C*/AS in
>> each availability zone).
>>
>> In the deployed application it seems to be a common use-case to one of
>> the 

Re: 3.11.2 memory leak

2018-07-22 Thread Abdul Patel
Any idea when 3.11.3 is coming in?

On Tuesday, June 19, 2018, kurt greaves  wrote:

> At this point I'd wait for 3.11.3. If you can't, you can get away with
> backporting a few repair fixes or just doing sub range repairs on 3.11.2
>
> On Wed., 20 Jun. 2018, 01:10 Abdul Patel,  wrote:
>
>> Hi All,
>>
>> Do we kmow whats the stable version for now if u wish to upgrade ?
>>
>> On Tuesday, June 5, 2018, Steinmaurer, Thomas <
>> thomas.steinmau...@dynatrace.com> wrote:
>>
>>> Jeff,
>>>
>>>
>>>
>>> FWIW, when talking about https://issues.apache.org/
>>> jira/browse/CASSANDRA-13929, there is a patch available since March
>>> without getting further attention.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Thomas
>>>
>>>
>>>
>>> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
>>> *Sent:* Dienstag, 05. Juni 2018 00:51
>>> *To:* cassandra 
>>> *Subject:* Re: 3.11.2 memory leak
>>>
>>>
>>>
>>> There have been a few people who have reported it, but nobody (yet) has
>>> offered a patch to fix it. It would be good to have a reliable way to
>>> repro, and/or an analysis of a heap dump demonstrating the problem (what's
>>> actually retained at the time you're OOM'ing).
>>>
>>>
>>>
>>> On Mon, Jun 4, 2018 at 6:52 AM, Abdul Patel  wrote:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> I recently upgraded my non prod cluster from 3.10 to 3.11.2.
>>>
>>> It was working fine for a 1.5 weeks then suddenly nodetool info startee
>>> reporting 80% and more memory consumption.
>>>
>>> Intially it was 16gb configured, then i bumped to 20gb and rebooted all
>>> 4 nodes of cluster-single DC.
>>>
>>> Now after 8 days i again see 80% + usage and its 16gb and above ..which
>>> we never saw before .
>>>
>>> Seems like memory leak bug?
>>>
>>> Does anyone has any idea ? Our 3.11.2 release rollout has been halted
>>> because of this.
>>>
>>> If not 3.11.2 whats the next best stable release we have now?
>>>
>>>
>>> The contents of this e-mail are intended for the named addressee only.
>>> It contains information that may be confidential. Unless you are the named
>>> addressee or an authorized designee, you may not copy or use it, or
>>> disclose it to anyone else. If you received it in error please notify us
>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>> number FN 91482h) is a company registered in Linz whose registered office
>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>
>>


Fwd: Re: Cassandra crashed with no log

2018-07-22 Thread onmstester onmstester
Thanks Jeff, At time of crash it said: .../linux-4.4.0/mm/pgtable-generic.c:33: 
bad pmd So i just run this on all of my nodes: echo never | sudo tee 
/sys/kernel/mm/transparent_hugepage/defrag Sent using Zoho Mail  
Forwarded message  From : Jeff Jirsa  To : 
 Date : Sun, 22 Jul 2018 10:43:38 +0430 Subject : 
Re: Cassandra crashed with no log  Forwarded message  
Anything in non-Cassandra logs? Dmesg? --  Jeff Jirsa On Jul 21, 2018, at 11:07 
PM, onmstester onmstester  wrote: Cassandra in one of my 
nodes, crashed without any error/warning in system/gc/debug log. All jmx 
metrics is being monitored, last fetched values for heap usage is 50% and for 
cpu usage is 20%. How can i find the cause of crash? Sent using Zoho Mail

JMX metric to report number failed WCL ALL

2018-07-22 Thread onmstester onmstester
I'm using RF=2 and Write consistency = ONE, is there a counter in cassandra jmx 
to report number of writes that only acknowledged by one node (instead of both 
replica's)?  Although i don't care all replicas acknowledge the write, but i 
consider this as normal status of cluster. Sent using Zoho Mail

Re: Stumped By Cassandra delays

2018-07-22 Thread shalom sagges
Hi Gareth,

If you're using batches for multiple partitions, this may be the root cause
you've been looking for.

https://inoio.de/blog/2016/01/13/cassandra-to-batch-or-not-to-batch/

If batches are optimally used and only one node is misbehaving, check if
NTP on the node is properly synced.

Hope this helps!


On Sat, Jul 21, 2018 at 9:31 PM, Gareth Collins 
wrote:

> Hello,
>
> We are running Cassandra 2.1.14 in AWS, with c5.4xlarge machines
> (initially these were m4.xlarge) for our cassandra servers and
> m4.xlarge for our application servers. On one of our clusters having
> problems we have 6 C* nodes and 6 AS nodes (two nodes for C*/AS in
> each availability zone).
>
> In the deployed application it seems to be a common use-case to one of
> the following. These use cases are having periodic errors:
> (1) Copy one Cassandra table to another table using the application server.
> (2) Export from a Cassandra table to file using the application server.
>
> The application server is reading from the table via token range, the
> token range queries being calculated to ensure the whole token range
> for a query falls on the same node. i.e. the query looks like this:
>
> select * from  where token(key) > ? and token(key) <= ?
>
> This was probably initially done on the assumption that the driver
> would be able to figure out which nodes contained the data. As we
> realized now the driver only supports routing to the right node if the
> partition key is defined in the where clause.
>
> When we do the read we are doing a lot of queries in parallel to
> maximize performance. I believe when the copy is being run there are
> currently 5 threads per machine doing the copy for a max of 30
> concurrent read requests across the cluster.
>
> Specifically these tasks been periodically having a few of these errors:
>
> INFO  [ScheduledTasks:1] 2018-07-13 20:03:20,124
> MessagingService.java:929 - REQUEST_RESPONSE messages were dropped in
> last 5000 ms: 1 for internal timeout and 0 for cross node timeout
>
> Which are causing errors in the read by token range queries.
>
> Running "nodetool settraceprobability 1" and running the test when
> failing we could see that this timeout would occur when using a
> coordinator on the read query (i.e. the co-ordinator sent the message
> but didn't get a response to the query from the other node within the
> time limit). We were seeing these timeouts periodically even if we set
> the timeouts to 60 seconds.
>
> As I mentioned at the beginning we had initially been using m4.xlarge
> for our Cassandra servers. After discussion with AWS it was suggested
> that we could be hitting performance limits (i.e. either network or
> disk - I believe more likely network as I didn't see the disk getting
> hit very hard) so we upgraded the Cassandra servers and everything was
> fine for a while.
>
> But then the problems started to re-occur recently...pretty
> consistently failing on these copy or export jobs running overnight.
> Having looked at resource usage statistics graphs it appeared that the
> C* servers were not heavily loaded at all (the app servers were being
> maxed out) and I did not see any significant garbage collections in
> the logs that could explain the delays.
>
> As a last resort I decided to turn up the logging on the server and
> client, datastax client set to debug and server set to the following
> logs via nodetool...the goal being to maximize logging while cutting
> out the very verbose stuff (e.g. Message.java appears to print out the
> whole message in 2.1.14 when put into debug -> it looks like that was
> moved to trace in a later 2.1.x release):
>
> bin/nodetool setlogginglevel org.apache.cassandra.tracing.Tracing INFO
> bin/nodetool setlogginglevel org.apache.cassandra.transport.Message INFO
> bin/nodetool setlogginglevel org.apache.cassandra.db.ColumnFamilyStore
> DEBUG
> bin/nodetool setlogginglevel org.apache.cassandra.gms.Gossiper DEBUG
> bin/nodetool setlogginglevel
> org.apache.cassandra.db.filter.SliceQueryFilter DEBUG
> bin/nodetool setlogginglevel
> org.apache.cassandra.service.pager.AbstractQueryPager INFO
> bin/nodetool setlogginglevel org.apache.cassandra TRACE
>
> Of course when we did this (as part of turning on the logging the
> application servers were restarted) the problematic export to file
> jobs which had failed every time for the last week succeeded and ran
> much faster than they had run usually (47 minutes vs 1 1/2 hours) so I
> decided to look for the biggest delay (which turned out to be ~9
> seconds and see what I could find in the log - outside of this time,
> the response times were up to perhaps 20ms). Here is what I found:
>
> (1) Only one Cassandra node had delays at a time.
>
> (2) On the Cassandra node that did had delays there was no significant
> information from the GCInspector (the system stopped processing client
> requests between 05:32:33 - 05:32:43). If anything it confirmed my
> belief that the system was lightly loaded
>
> 

Re: Cassandra crashed with no log

2018-07-22 Thread Jeff Jirsa
Anything in non-Cassandra logs? Dmesg?

-- 
Jeff Jirsa


> On Jul 21, 2018, at 11:07 PM, onmstester onmstester  
> wrote:
> 
> Cassandra in one of my nodes, crashed without any error/warning in 
> system/gc/debug log. All jmx metrics is being monitored, last fetched values 
> for heap usage is 50% and for cpu usage is 20%. How can i find the cause of 
> crash?
> 
> Sent using Zoho Mail
> 
> 
> 


Cassandra crashed with no log

2018-07-22 Thread onmstester onmstester
Cassandra in one of my nodes, crashed without any error/warning in 
system/gc/debug log. All jmx metrics is being monitored, last fetched values 
for heap usage is 50% and for cpu usage is 20%. How can i find the cause of 
crash? Sent using Zoho Mail