Re: Cassandra process exiting mysteriously

2014-08-12 Thread Or Sher
Clint, did you find anything?
I just noticed it happens to us too on only one node in our CI cluster.
I don't think there is  a special usage before it happens... The last line
in the log before the shutdown lines in at least an hour before..
We're using C* 2.0.9.


On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi Rob,

 Thanks for the clarification; this is really useful.  I'll run some
 experiments to see if the problem is a JVM OOM on our build machine.

 Best regards,
 Clint

 On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote:
  On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com
 wrote:
 
  On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com
  wrote:
 
  this doesn't look like an OOM to me.  If the kernel OOM kills Cassandra
  then Cassandra instantly vaporizes, and there will be nothing in the
  Cassandra logs (you will find information about the OOM in the system
 logs
  though, eg in dmesg).  In the log snippet above you see an orderly
 shutdown,
  this is completely different to the instant OOM kill.
 
 
  Not really.
 
  https://issues.apache.org/jira/browse/CASSANDRA-7507
 
 
  To be clear, there's two different OOMs here, I am talking about the JVM
  OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not
  necessarily result in the cassandra process dying, and can in fact
 trigger
  clean shutdown.
 
  System level OOM will in fact send the equivalent of KILL, which will not
  trigger the clean shutdown hook in Cassandra.
 
  =Rob




-- 
Or Sher


Cassandra schema disagreement

2014-08-12 Thread Demeyer Jonathan
Hello,

I have a cluster running and I'm trying to change the schema on it. Altough it 
succeeds on one cluster (a test one), on another it keeps creating two separate 
schema versions (both are 2 DC configuration; the cluster where it goes wrong 
end up with a schema version on each DC).

I use apache-cassandra11-1.1.12 on CentOS 6.4

I'm trying to start from a fresh cassandra config (doing  rm -rf 
/var/lib/cassandra/{commitlog,data}/*  while cassandra is stopped).

Each DC are on separate IP segment but there are no firewall between them.

Here is the output of the command when the desynchronisation occurs:
---
[root@cassandranode00 CDN]# cassandra-cli -f reCreateCassandraStruct.sh
Connected to: TTF Cluster v2013_1257 on 127.0.0.1/9160
7ef8c681-189a-3088-8598-560437f705d9
Waiting for schema agreement...
... schemas agree across the cluster
Authenticated to keyspace: ks1
f179fd8e-f8ca-36cf-bf53-d8341fd6006e
Waiting for schema agreement...
The schema has not settled in 10 seconds; further migrations are ill-advised 
until it does.
Versions are f179fd8e-f8ca-36cf-bf53-d8341fd6006e:[10.69.221.20, 10.69.221.21, 
10.69.221.22], e9656b30-b671-3fce-9fb4-bdd3e6da36d1:[1
0.69.10.14, 10.69.10.13, 10.69.10.11]
---

I also try creating a keyspace with a column family using the opscenter (with 
no good result).

I'm out of hint to where to look. Do you have some suggestions ?

Is there improvements on this side with cassandra  1.1.12 ?

Thanks,
Jonathan DEMEYER
Here is the start of reCreateCassandraStruct.sh :
CREATE KEYSPACE ks1 WITH placement_strategy = 'NetworkTopologyStrategy' AND 
strategy_options={DC1:3,DC2:3};
use ks1;
create column family id
with comparator = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and column_metadata = [
{
column_name : 'user',
validation_class : UTF8Type
}
];
CREATE KEYSPACE ks2 WITH placement_strategy = 'NetworkTopologyStrategy' AND 
strategy_options={DC1:3,DC2:3};
use ks2;
create column family id;


Cassandra corrupt column family

2014-08-12 Thread Batranut Bogdan
Hello all,

I have altered a table in cassandra and on one node it somehow got corrupted. I 
the changes did not propagate ok. Ran repair keyspace columnfamily... noting 
changed... 

Is there a way to repair this?

Replacing a dead node in Cassandra 2.0.8

2014-08-12 Thread tsi
In the datastax documentation there is a description how to replace a dead
node
(http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html).
Is the  replace_address option required even if the IP address of the new
node is the same as the original one (I read a note about the auto
bootstrapping being stored somewhere in the system tables)?



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replacing-a-dead-node-in-Cassandra-2-0-8-tp7596245.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


RE: Cassandra schema disagreement

2014-08-12 Thread Demeyer Jonathan
After a lot of investigation, it seems that the clocks were desynchronized 
through the cluster (altough we did not check that resyncing them resolve the 
problem, we modify the schma with one node up and restart all other nodes 
afterwards).


From: Demeyer Jonathan [mailto:jonathan.deme...@macq.eu]
Sent: mardi 12 août 2014 11:03
To: user@cassandra.apache.org
Subject: Cassandra schema disagreement

Hello,

I have a cluster running and I'm trying to change the schema on it. Altough it 
succeeds on one cluster (a test one), on another it keeps creating two separate 
schema versions (both are 2 DC configuration; the cluster where it goes wrong 
end up with a schema version on each DC).

I use apache-cassandra11-1.1.12 on CentOS 6.4

I'm trying to start from a fresh cassandra config (doing  rm -rf 
/var/lib/cassandra/{commitlog,data}/*  while cassandra is stopped).

Each DC are on separate IP segment but there are no firewall between them.

Here is the output of the command when the desynchronisation occurs:
---
[root@cassandranode00 CDN]# cassandra-cli -f reCreateCassandraStruct.sh
Connected to: TTF Cluster v2013_1257 on 127.0.0.1/9160
7ef8c681-189a-3088-8598-560437f705d9
Waiting for schema agreement...
... schemas agree across the cluster
Authenticated to keyspace: ks1
f179fd8e-f8ca-36cf-bf53-d8341fd6006e
Waiting for schema agreement...
The schema has not settled in 10 seconds; further migrations are ill-advised 
until it does.
Versions are f179fd8e-f8ca-36cf-bf53-d8341fd6006e:[10.69.221.20, 10.69.221.21, 
10.69.221.22], e9656b30-b671-3fce-9fb4-bdd3e6da36d1:[1
0.69.10.14, 10.69.10.13, 10.69.10.11]
---

I also try creating a keyspace with a column family using the opscenter (with 
no good result).

I'm out of hint to where to look. Do you have some suggestions ?

Is there improvements on this side with cassandra  1.1.12 ?

Thanks,
Jonathan DEMEYER
Here is the start of reCreateCassandraStruct.sh :
CREATE KEYSPACE ks1 WITH placement_strategy = 'NetworkTopologyStrategy' AND 
strategy_options={DC1:3,DC2:3};
use ks1;
create column family id
with comparator = 'UTF8Type'
and key_validation_class = 'UTF8Type'
and column_metadata = [
{
column_name : 'user',
validation_class : UTF8Type
}
];
CREATE KEYSPACE ks2 WITH placement_strategy = 'NetworkTopologyStrategy' AND 
strategy_options={DC1:3,DC2:3};
use ks2;
create column family id;


Re: Cassandra corrupt column family

2014-08-12 Thread Mark Reddy
Hi,

Without more information (Cassandra version, setup, topology, schema,
queries performed) this list won't be able to assist you. If you can
provide a more detailed explanation of the steps you took to reach your
current state that would be great.


Mark


On Tue, Aug 12, 2014 at 12:21 PM, Batranut Bogdan batra...@yahoo.com
wrote:

 Hello all,

 I have altered a table in cassandra and on one node it somehow got
 corrupted. I the changes did not propagate ok. Ran repair keyspace
 columnfamily... noting changed...

 Is there a way to repair this?



Re: clarification on 100k tombstone limit in indexes

2014-08-12 Thread DuyHai Doan
Hello Ian

So that way each index entry *will* have quite a few entries and the index
as a whole won't grow too big.  Is my thinking correct here? -- In this
case yes. Do not forget that for each date value, there will be 1
corresponding index value + 10 updates. If you have an approximate count
for a few entries, a quick maths should give you an idea about how
large the index partition is

I had considered an approach like this but my concern is that for any
given minute *all* of the updates will be handled by a single node, right?
-- If you time resolution is a minute, yes it will be a problem. And
depending on the insert rate, it can become a quickly a bottle neck during
this minute.

 The manual index approach suffers a lot from bottleneck issue for heavy
workload, that's the main reason they implement a distributed secondary
index. There is no free lunch though. What you gain in term of control and
tuning with the manual index, you loose on the load distribution side.




On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi DuyHai,

 Thanks for the detailed response!  A few responses below:

 On a side node, your usage of secondary index is not the best one.
 Indeed, indexing the update date will lead to a situation where for one
 date, you'll mostly have one or a few matching items (assuming that the
 update date resolution is small enough and update rate is not intense).
 -- I should have mentioned this original (slipped my mind) but to deal
 specifically with this problem I had planned to use a timestamp with a
 resolution of 1 minute (like your minute_bucket).  So that way each index
 entry *will* have quite a few entries and the index as a whole won't grow
 too big.  Is my thinking correct here?

 You better off create a manuel reverse-index to track modification date,
 something like this  -- I had considered an approach like this but my
 concern is that for any given minute *all* of the updates will be handled
 by a single node, right?  For example, if the minute_bucket is 2739 then
 for that one minute, every single item update will flow to the node at
 HASH(2739).  Assuming I am thinking about that right, that seemed like a
 potential scaling bottleneck, which scared me off that approach.

 Cheers,
 Ian




 On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Ian

 It sounds like this 100k limit is, indeed, a global limit as opposed
 to a per-row limit --The threshold applies to each REQUEST, not
 partition or globally.

 The threshold does not apply to a partition (physical row) simply because
 in one request you can fetch data from many partitions (multi get slice).
 There was a JIRA about this here:
 https://issues.apache.org/jira/browse/CASSANDRA-6865

 Are these tombstones ever GCed out of the index? -- Yes they are,
 during compactions of the index column family.

 How frequently? -- That's the real pain. Indeed you do not have any
 control on the tuning of secondary index CF compaction. As far as I know,
 the compaction settings (strategy, min/max thresholds...) inherits from the
 one of the base table

 Now, by looking very fast into your data model, it seems that you have a
 skinny partition patter. Since you mentioned that the date is updated only
 10 times max, you should not run into the tombstonne threshold issue.

 On a side node, your usage of secondary index is not the best one.
 Indeed, indexing the update date will lead to a situation where for one
 date, you'll mostly have one or a few matching items (assuming that the
 update date resolution is small enough and update rate is not intense). It
 is the high-cardinality scenario to be avoided (
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
 Plus, the query on the index (find all items where last_updated  [now - 30
 minutes]) makes things worse since it is not an exact match but inequality.

  You better off create a manuel reverse-index to track modification date,
 something like this:

 CREATE TABLE last_updated_item (
 minute_bucket int, // format MMDDHHmm
 last_update_date timestamp,
 item_id ascii,
 PRIMARY KEY(minute_bucket, last_update_date)
 );

  The last_update_date column is quite self-explanatory. The minute_bucket
 is trickier. The idea is to split ranges on 30 minutes into buckets. 00:00
 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on. For a whole
 day, you'd have 48 buckets. We need to put data into buckets to avoid ultra
 wide rows since you mentioned that there are 10 items (so 10 updates) /
 sec. Of course, 30 mins is just an exemple, you can tune it down to a
 window of 5 minutes or 1 minute, depending on the insertion rate.





 On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi Mark -

 Thanks for the clarification but as I'm not too familiar with the nuts 
 bolts of Cassandra I'm not sure how to apply that info to my current
 situation.  It sounds like this 

Re: Node bootstrap

2014-08-12 Thread Ruchir Jha
Still having issues with node bootstrapping. The new node just died,
because it Full Gced, the nodes it had actual streams with noticed its
down. After the full gc finished the new node printed this log :

ERROR 02:52:36,259 Stream failed because /10.10.20.35 died or was
restarted/removed (streams may still be active in background, but further
streams won't be started)

Here 10.10.20.35 is an existing node, the new guy was streaming from. A
similar log was printed for every other node on the cluster. Why did the
new node just exit after the FGC pause?

We have heap dumps enabled on Full GC's and this are the top offenders on
the new node. A new entry that I noticed is the CompressionMetaData chunks.
Anything I can do to optimize that?

 num #instances #bytes  class name
--
   1:  42508421 4818885752  [B
   2:  65860543 3161306064  java.nio.HeapByteBuffer
   3: 124361093 2984666232
 org.apache.cassandra.io.compress.CompressionMetadata$Chunk
   4:  29745665 1427791920
 edu.stanford.ppl.concurrent.SnapTreeMap$Node
   5:  29810362  953931584  org.apache.cassandra.db.Column
   6: 31623  498012768
 [Lorg.apache.cassandra.io.compress.CompressionMetadata$Chunk;



On Tue, Aug 5, 2014 at 2:59 PM, Ruchir Jha ruchir@gmail.com wrote:

 Also, right now the top command shows that we are at 500-700% CPU, and
 we have 23 total processors, which means we have a lot of idle CPU left
 over, so throwing more threads at compaction and flush should alleviate the
 problem?


 On Tue, Aug 5, 2014 at 2:57 PM, Ruchir Jha ruchir@gmail.com wrote:


 Right now, we have 6 flush writers and compaction_throughput_mb_per_sec
 is set to 0, which I believe disables throttling.

 Also, Here is the iostat -x 5 5 output:


 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda  10.00  1450.35   50.79   55.92  9775.97 12030.14
 204.34 1.56   14.62   1.05  11.21
 dm-0  0.00 0.003.59   18.82   166.52   150.35
  14.14 0.44   19.49   0.54   1.22
 dm-1  0.00 0.002.325.3718.5642.98
 8.00 0.76   98.82   0.43   0.33
 dm-2  0.00 0.00  162.17 5836.66 32714.46 47040.87
  13.30 5.570.90   0.06  36.00
 sdb   0.40  4251.90  106.72  107.35 23123.61 35204.09
 272.46 4.43   20.68   1.29  27.64

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 14.64   10.751.81   13.500.00   59.29

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda  15.40  1344.60   68.80  145.60  4964.80 11790.40
  78.15 0.381.80   0.80  17.10
 dm-0  0.00 0.00   43.00 1186.20  2292.80  9489.60
 9.59 4.883.90   0.09  11.58
 dm-1  0.00 0.001.600.0012.80 0.00
 8.00 0.03   16.00   2.00   0.32
 dm-2  0.00 0.00  197.20 17583.80 35152.00 140664.00
 9.89  2847.50  109.52   0.05  93.50
 sdb  13.20 16552.20  159.00  742.20 32745.60 129129.60
 179.6272.88   66.01   1.04  93.42

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   15.51   19.771.975.020.00   57.73

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda  16.20   523.40   60.00  285.00  5220.80  5913.60
  32.27 0.250.72   0.60  20.86
 dm-0  0.00 0.000.801.4032.0011.20
  19.64 0.013.18   1.55   0.34
 dm-1  0.00 0.001.600.0012.80 0.00
 8.00 0.03   21.00   2.62   0.42
 dm-2  0.00 0.00  339.40 5886.80 66219.20 47092.80
  18.20   251.66  184.72   0.10  63.48
 sdb   1.00  5025.40  264.20  209.20 60992.00 50422.40
 235.35 5.98   40.92   1.23  58.28

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   16.59   16.342.039.010.00   56.04

 Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
 avgrq-sz avgqu-sz   await  svctm  %util
 sda   5.40   320.00   37.40  159.80  2483.20  3529.60
  30.49 0.100.52   0.39   7.76
 dm-0  0.00 0.000.203.60 1.6028.80
 8.00 0.000.68   0.68   0.26
 dm-1  0.00 0.000.000.00 0.00 0.00
 0.00 0.000.00   0.00   0.00
 dm-2  0.00 0.00  287.20 13108.20 53985.60 104864.00
  11.86   869.18   48.82   0.06  76.96
 sdb   5.20 12163.40  238.20  532.00 51235.20 93753.60
 188.2521.46   23.75   0.97  75.08



 On Tue, Aug 5, 2014 at 1:55 PM, Mark Reddy mark.re...@boxever.com
 wrote:

 Hi Ruchir,

 With the large number of blocked flushes and the number of pending
 compactions would still indicate IO contention. Can you post the output of
 'iostat -x 5 5'

 

Re: clarification on 100k tombstone limit in indexes

2014-08-12 Thread Ian Rose
Makes sense - thanks again!


On Tue, Aug 12, 2014 at 9:45 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Hello Ian

 So that way each index entry *will* have quite a few entries and the
 index as a whole won't grow too big.  Is my thinking correct here? -- In
 this case yes. Do not forget that for each date value, there will be 1
 corresponding index value + 10 updates. If you have an approximate count
 for a few entries, a quick maths should give you an idea about how
 large the index partition is

 I had considered an approach like this but my concern is that for any
 given minute *all* of the updates will be handled by a single node,
 right? -- If you time resolution is a minute, yes it will be a problem.
 And depending on the insert rate, it can become a quickly a bottle neck
 during this minute.

  The manual index approach suffers a lot from bottleneck issue for heavy
 workload, that's the main reason they implement a distributed secondary
 index. There is no free lunch though. What you gain in term of control and
 tuning with the manual index, you loose on the load distribution side.




 On Mon, Aug 11, 2014 at 11:17 PM, Ian Rose ianr...@fullstory.com wrote:

 Hi DuyHai,

 Thanks for the detailed response!  A few responses below:

 On a side node, your usage of secondary index is not the best one.
 Indeed, indexing the update date will lead to a situation where for one
 date, you'll mostly have one or a few matching items (assuming that the
 update date resolution is small enough and update rate is not intense).
 -- I should have mentioned this original (slipped my mind) but to deal
 specifically with this problem I had planned to use a timestamp with a
 resolution of 1 minute (like your minute_bucket).  So that way each index
 entry *will* have quite a few entries and the index as a whole won't
 grow too big.  Is my thinking correct here?

 You better off create a manuel reverse-index to track modification
 date, something like this  -- I had considered an approach like this but
 my concern is that for any given minute *all* of the updates will be
 handled by a single node, right?  For example, if the minute_bucket is 2739
 then for that one minute, every single item update will flow to the node at
 HASH(2739).  Assuming I am thinking about that right, that seemed like a
 potential scaling bottleneck, which scared me off that approach.

 Cheers,
 Ian




 On Sun, Aug 10, 2014 at 5:20 PM, DuyHai Doan doanduy...@gmail.com
 wrote:

 Hello Ian

 It sounds like this 100k limit is, indeed, a global limit as opposed
 to a per-row limit --The threshold applies to each REQUEST, not
 partition or globally.

 The threshold does not apply to a partition (physical row) simply
 because in one request you can fetch data from many partitions (multi get
 slice). There was a JIRA about this here:
 https://issues.apache.org/jira/browse/CASSANDRA-6865

 Are these tombstones ever GCed out of the index? -- Yes they are,
 during compactions of the index column family.

 How frequently? -- That's the real pain. Indeed you do not have any
 control on the tuning of secondary index CF compaction. As far as I know,
 the compaction settings (strategy, min/max thresholds...) inherits from the
 one of the base table

 Now, by looking very fast into your data model, it seems that you have a
 skinny partition patter. Since you mentioned that the date is updated only
 10 times max, you should not run into the tombstonne threshold issue.

 On a side node, your usage of secondary index is not the best one.
 Indeed, indexing the update date will lead to a situation where for one
 date, you'll mostly have one or a few matching items (assuming that the
 update date resolution is small enough and update rate is not intense). It
 is the high-cardinality scenario to be avoided (
 http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html).
 Plus, the query on the index (find all items where last_updated  [now - 30
 minutes]) makes things worse since it is not an exact match but inequality.

  You better off create a manuel reverse-index to track modification
 date, something like this:

 CREATE TABLE last_updated_item (
 minute_bucket int, // format MMDDHHmm
 last_update_date timestamp,
 item_id ascii,
 PRIMARY KEY(minute_bucket, last_update_date)
 );

  The last_update_date column is quite self-explanatory. The
 minute_bucket is trickier. The idea is to split ranges on 30 minutes into
 buckets. 00:00 to 00:30 is bucket 1, 00:30 to 01:00 is bucket 2 and so on.
 For a whole day, you'd have 48 buckets. We need to put data into buckets to
 avoid ultra wide rows since you mentioned that there are 10 items (so 10
 updates) / sec. Of course, 30 mins is just an exemple, you can tune it down
 to a window of 5 minutes or 1 minute, depending on the insertion rate.





 On Sun, Aug 10, 2014 at 10:02 PM, Ian Rose ianr...@fullstory.com
 wrote:

 Hi Mark -

 Thanks for the clarification but as I'm not too 

Re: Cassandra process exiting mysteriously

2014-08-12 Thread Clint Kelly
Hi Or,

For now I removed the test that was failing like this from our suite
and made a note to revisit it in a couple of weeks.  Unfortunately I
still don't know what the issue is.  I'll post here if I figure out it
(please do the same!).  My working hypothesis now is that we had some
kind of OOM problem.

Best regards,
Clint

On Tue, Aug 12, 2014 at 12:23 AM, Or Sher or.sh...@gmail.com wrote:
 Clint, did you find anything?
 I just noticed it happens to us too on only one node in our CI cluster.
 I don't think there is  a special usage before it happens... The last line
 in the log before the shutdown lines in at least an hour before..
 We're using C* 2.0.9.


 On Thu, Aug 7, 2014 at 12:49 AM, Clint Kelly clint.ke...@gmail.com wrote:

 Hi Rob,

 Thanks for the clarification; this is really useful.  I'll run some
 experiments to see if the problem is a JVM OOM on our build machine.

 Best regards,
 Clint

 On Wed, Aug 6, 2014 at 1:14 PM, Robert Coli rc...@eventbrite.com wrote:
  On Wed, Aug 6, 2014 at 1:12 PM, Robert Coli rc...@eventbrite.com
  wrote:
 
  On Wed, Aug 6, 2014 at 1:11 AM, Duncan Sands duncan.sa...@gmail.com
  wrote:
 
  this doesn't look like an OOM to me.  If the kernel OOM kills
  Cassandra
  then Cassandra instantly vaporizes, and there will be nothing in the
  Cassandra logs (you will find information about the OOM in the system
  logs
  though, eg in dmesg).  In the log snippet above you see an orderly
  shutdown,
  this is completely different to the instant OOM kill.
 
 
  Not really.
 
  https://issues.apache.org/jira/browse/CASSANDRA-7507
 
 
  To be clear, there's two different OOMs here, I am talking about the JVM
  OOM, not system level. As CASSANDRA-7507 indicates, JVM OOM does not
  necessarily result in the cassandra process dying, and can in fact
  trigger
  clean shutdown.
 
  System level OOM will in fact send the equivalent of KILL, which will
  not
  trigger the clean shutdown hook in Cassandra.
 
  =Rob




 --
 Or Sher


OOM(Java heap space) on start-up during commit log replaying

2014-08-12 Thread jivko donev
Hi all, 

We have a node with commit log director ~4G. During start-up of the node on 
commit log replaying the used heap space is constantly growing ending with OOM 
error. 

The heap size and new heap size properties are - 1G and 256M. We are using the 
default settings for commitlog_sync, commitlog_sync_period_in_ms and 
commitlog_segment_size_in_mb.
 
The log shows that cassandra is stuck on MutationStage:
Active   Pending      Completed   Blocked 

 16           385              196                  0 


The stack trace is:
ERROR [metrics-meter-tick-thread-1] 2014-08-12 19:15:10,181 
CassandraDaemon.java (line 198) Exception in thread 
Thread[metrics-meter-tick-thread-1,5,main]
java.lang.OutOfMemoryError: Java heap space
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.addWaiter(Unknown Source)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(Unknown Source)
        at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(Unknown 
Source)
        at java.util.concurrent.locks.ReentrantLock.lock(Unknown Source)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.offer(Unknown 
Source)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.add(Unknown 
Source)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.add(Unknown 
Source)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor.reExecutePeriodic(Unknown 
Source)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
 Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
ERROR [MutationStage:8] 2014-08-12 19:15:10,181 CassandraDaemon.java (line 198) 
Exception in thread Thread[MutationStage:8,5,main]
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.duplicate(Unknown Source)
        at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:62)
        at 
org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:72)
        at 
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:99)
        at 
org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:35)
        at 
org.apache.cassandra.db.RangeTombstoneList.addAll(RangeTombstoneList.java:188)
        at org.apache.cassandra.db.DeletionInfo.add(DeletionInfo.java:219)
        at 
org.apache.cassandra.db.AtomicSortedColumns.addAllWithSizeDelta(AtomicSortedColumns.java:184)
        at org.apache.cassandra.db.Memtable.resolve(Memtable.java:226)
        at org.apache.cassandra.db.Memtable.put(Memtable.java:173)
        at 
org.apache.cassandra.db.ColumnFamilyStore.apply(ColumnFamilyStore.java:893)
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:368)
        at org.apache.cassandra.db.Keyspace.apply(Keyspace.java:333)
        at 
org.apache.cassandra.db.commitlog.CommitLogReplayer$1.runMayThrow(CommitLogReplayer.java:352)
        at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
        at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.util.concurrent.FutureTask.run(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)
ERROR [MutationStage:8] 2014-08-12 19:15:12,080 CassandraDaemon.java (line 198) 
Exception in thread Thread[MutationStage:8,5,main]
java.lang.IllegalThreadStateException
        at java.lang.Thread.start(Unknown Source)
        at 
org.apache.cassandra.service.CassandraDaemon$2.uncaughtException(CassandraDaemon.java:204)
        at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.handleOrLog(DebuggableThreadPoolExecutor.java:220)
        at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.logExceptionsAfterExecute(DebuggableThreadPoolExecutor.java:203)
        at 
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:183)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.lang.Thread.run(Unknown Source)


Increasing the heap space to 2G solves the problem but we want to know if the 
problem could be solved without increasing the heap space. Does anyone have 
experience similar problem? If so are there any tuning options in 
cassandra.yaml? 
  

Any help will be much appreciated. If you need more information fell free to 
ask.

Thanks,
Jivko Donev

Number of columns per row for composite columns?

2014-08-12 Thread hlqv
Hi everyone,
I'm confused with number of columns in a row of Cassandra, as far as I know
there is 2 billions columns per row. Like that if I have a composite column
name in each row, for ex: (timestamp, userid), then number of columns per
row is the number of distinct 'timestamp' or each distinct 'timestamp,
userid' is a column?


Re: OOM(Java heap space) on start-up during commit log replaying

2014-08-12 Thread Robert Coli
On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote:

 We have a node with commit log director ~4G. During start-up of the node
 on commit log replaying the used heap space is constantly growing ending
 with OOM error.

 The heap size and new heap size properties are - 1G and 256M. We are using
 the default settings for commitlog_sync, commitlog_sync_period_in_ms
 and commitlog_segment_size_in_mb.


What version of Cassandra?

1G is tiny for cassandra heap. There is a direct relationship between the
data in the commitlog and memtables and in the heap. You almost certainly
need more heap or less commitlog.

=Rob


Re: Replacing a dead node in Cassandra 2.0.8

2014-08-12 Thread Robert Coli
On Tue, Aug 12, 2014 at 4:33 AM, tsi thorsten.s...@t-systems.com wrote:

 In the datastax documentation there is a description how to replace a dead
 node
 (
 http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html
 ).
 Is the  replace_address option required even if the IP address of the new
 node is the same as the original one (I read a note about the auto
 bootstrapping being stored somewhere in the system tables)?


In order for the node to bootstrap into ranges the rest of the cluster
thinks it already owns, you will need to provide the ip in replace_address.
This allows it to start up in a special way that is effectively bootstrap
to the same tokens it previously had.

=Rob


Re: clarification on 100k tombstone limit in indexes

2014-08-12 Thread Tyler Hobbs
On Mon, Aug 11, 2014 at 4:17 PM, Ian Rose ianr...@fullstory.com wrote:


 You better off create a manuel reverse-index to track modification date,
 something like this  -- I had considered an approach like this but my
 concern is that for any given minute *all* of the updates will be handled
 by a single node, right?  For example, if the minute_bucket is 2739 then
 for that one minute, every single item update will flow to the node at
 HASH(2739).  Assuming I am thinking about that right, that seemed like a
 potential scaling bottleneck, which scared me off that approach.


If you're concerned about bottlenecking on one node (or set of replicas)
during the minute, add an additional integer column to the partition key
(making it a composite partition key if it isn't already).  When inserting,
randomly pick a value between, say, 0 and 10 to use for this column.  When
reading, read all 10 partitions and merge them.  (Alternatively, instead of
using a random number, you could hash the other key components and use the
lowest bits for the value.  This has the advantage of being deterministic.)


-- 
Tyler Hobbs
DataStax http://datastax.com/


Re: OOM(Java heap space) on start-up during commit log replaying

2014-08-12 Thread jivko donev
Hi Robert,

Thanks for your reply. The Cassandra version is 2.07. Is there some commonly 
used rule for determining the commitlog and memtables size depending on the 
heap size? What would be the main disadvantage when having smaller commitlog?


On Tuesday, August 12, 2014 8:32 PM, Robert Coli rc...@eventbrite.com wrote:
 




On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote:

We have a node with commit log director ~4G. During start-up of the node on 
commit log replaying the used heap space is constantly growing ending with OOM 
error. 



The heap size and new heap size properties are - 1G and 256M. We are using the 
default settings for commitlog_sync, commitlog_sync_period_in_ms and 
commitlog_segment_size_in_mb.

What version of Cassandra?

1G is tiny for cassandra heap. There is a direct relationship between the data 
in the commitlog and memtables and in the heap. You almost certainly need more 
heap or less commitlog.

=Rob

Re: Number of columns per row for composite columns?

2014-08-12 Thread Jack Krupansky
Your question is a little too tangled for me... Are you asking about rows in a 
partition (some people call that a “storage row”) or columns per row? The 
latter is simply the number of columns that you have declared in your table.

The total number of columns – or more properly, “cells” – in a partition would 
be the number of rows you have inserted in that partition times the number of 
columns you have declared in the table.

If you need to review the terminology:
http://www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

-- Jack Krupansky

From: hlqv 
Sent: Tuesday, August 12, 2014 1:13 PM
To: user@cassandra.apache.org 
Subject: Number of columns per row for composite columns?

Hi everyone,
I'm confused with number of columns in a row of Cassandra, as far as I know 
there is 2 billions columns per row. Like that if I have a composite column 
name in each row, for ex: (timestamp, userid), then number of columns per row 
is the number of distinct 'timestamp' or each distinct 'timestamp, userid' is a 
column?


Nodetool Repair questions

2014-08-12 Thread Viswanathan Ramachandran
Some questions on nodetool repair.

1. This tool repairs inconsistencies across replicas of the row. Since
latest update always wins, I dont see inconsistencies other than ones
resulting from the combination of deletes, tombstones, and crashed nodes.
Technically, if data is never deleted from cassandra, then nodetool repair
does not need to be run at all. Is this understanding correct? If wrong,
can anyone provide other ways inconsistencies could occur?

2. Want to understand the performance of 'nodetool repair' in a Cassandra
multi data center setup. As we add nodes to the cluster in various data
centers, does the performance of nodetool repair on each node increase
linearly, or is it quadratic ? The essence of this question is: If I have a
keyspace with x number of replicas in each data center, do I have to deal
with an upper limit on the number of data centers/nodes?


Thanks

Vish


Re: Nodetool Repair questions

2014-08-12 Thread Mark Reddy
Hi Vish,

1. This tool repairs inconsistencies across replicas of the row. Since
 latest update always wins, I dont see inconsistencies other than ones
 resulting from the combination of deletes, tombstones, and crashed nodes.
 Technically, if data is never deleted from cassandra, then nodetool repair
 does not need to be run at all. Is this understanding correct? If wrong,
 can anyone provide other ways inconsistencies could occur?


Even if you never delete data you should run repairs occasionally to ensure
overall consistency. While hinted handoffs and read repairs do lead to
better consistency, they are only helpers/optimization and are not regarded
as operations that ensure consistency.

2. Want to understand the performance of 'nodetool repair' in a Cassandra
 multi data center setup. As we add nodes to the cluster in various data
 centers, does the performance of nodetool repair on each node increase
 linearly, or is it quadratic ?


Its difficult to calculate the performance of a repair, I've seen the time
to completion fluctuate between 4hrs to 10hrs+ on the same node. However in
theory adding more nodes would spread the data and free up machine
resources, thus resulting in more performant repairs.

The essence of this question is: If I have a keyspace with x number of
 replicas in each data center, do I have to deal with an upper limit on the
 number of data centers/nodes?


Could you expand on why you believe there would be an upper limit of
dc/nodes due to running repairs?


Mark


On Tue, Aug 12, 2014 at 10:06 PM, Viswanathan Ramachandran 
vish.ramachand...@gmail.com wrote:

 Some questions on nodetool repair.

 1. This tool repairs inconsistencies across replicas of the row. Since
 latest update always wins, I dont see inconsistencies other than ones
 resulting from the combination of deletes, tombstones, and crashed nodes.
 Technically, if data is never deleted from cassandra, then nodetool repair
 does not need to be run at all. Is this understanding correct? If wrong,
 can anyone provide other ways inconsistencies could occur?

 2. Want to understand the performance of 'nodetool repair' in a Cassandra
 multi data center setup. As we add nodes to the cluster in various data
 centers, does the performance of nodetool repair on each node increase
 linearly, or is it quadratic ? The essence of this question is: If I have a
 keyspace with x number of replicas in each data center, do I have to deal
 with an upper limit on the number of data centers/nodes?


 Thanks

 Vish



Re: OOM(Java heap space) on start-up during commit log replaying

2014-08-12 Thread graham sanderson
Agreed need more details; and just start by increasing heap because that may 
wells solve the problem.

I have just observed (which makes sense when you think about it) while testing 
fix for https://issues.apache.org/jira/browse/CASSANDRA-7546, that if you are 
replaying a commit log which has a high level of updates for the same partition 
key, you can hit that issue - excess memory allocation under high contention 
for the same partition key - (this might not cause OOM but will certainly 
massively tax GC and it sounds like you don’t have a lot/any headroom).

On Aug 12, 2014, at 12:31 PM, Robert Coli rc...@eventbrite.com wrote:

 
 On Tue, Aug 12, 2014 at 9:34 AM, jivko donev jivko_...@yahoo.com wrote:
 We have a node with commit log director ~4G. During start-up of the node on 
 commit log replaying the used heap space is constantly growing ending with 
 OOM error. 
 
 The heap size and new heap size properties are - 1G and 256M. We are using 
 the default settings for commitlog_sync, commitlog_sync_period_in_ms and 
 commitlog_segment_size_in_mb.
 
 What version of Cassandra?
 
 1G is tiny for cassandra heap. There is a direct relationship between the 
 data in the commitlog and memtables and in the heap. You almost certainly 
 need more heap or less commitlog.
 
 =Rob
   



smime.p7s
Description: S/MIME cryptographic signature


Re: Nodetool Repair questions

2014-08-12 Thread Andrey Ilinykh
1. You don't have to repair if you use QUORUM consistency and you don't
delete data.
2.Performance depends on size of data each node has. It's very difficult to
predict. It may take days.

Thank you,
  Andrey


On Tue, Aug 12, 2014 at 2:06 PM, Viswanathan Ramachandran 
vish.ramachand...@gmail.com wrote:

 Some questions on nodetool repair.

 1. This tool repairs inconsistencies across replicas of the row. Since
 latest update always wins, I dont see inconsistencies other than ones
 resulting from the combination of deletes, tombstones, and crashed nodes.
 Technically, if data is never deleted from cassandra, then nodetool repair
 does not need to be run at all. Is this understanding correct? If wrong,
 can anyone provide other ways inconsistencies could occur?

 2. Want to understand the performance of 'nodetool repair' in a Cassandra
 multi data center setup. As we add nodes to the cluster in various data
 centers, does the performance of nodetool repair on each node increase
 linearly, or is it quadratic ? The essence of this question is: If I have a
 keyspace with x number of replicas in each data center, do I have to deal
 with an upper limit on the number of data centers/nodes?


 Thanks

 Vish



Re: Nodetool Repair questions

2014-08-12 Thread Viswanathan Ramachandran
Thanks Mark,
Since we have replicas in each data center, addition of a new data center
(and new replicas) has a performance implication on nodetool repair.
I do understand that adding nodes without increasing number of replicas may
improve repair performance, but in this case we are adding new data center
and additional replicas which is an added overhead on nodetool repair.
Hence the thinking that we may reach an upper limit which could be the
point when the nodetool repair costs are way too high.


On Tue, Aug 12, 2014 at 2:59 PM, Mark Reddy mark.re...@boxever.com wrote:

 Hi Vish,

 1. This tool repairs inconsistencies across replicas of the row. Since
 latest update always wins, I dont see inconsistencies other than ones
 resulting from the combination of deletes, tombstones, and crashed nodes.
 Technically, if data is never deleted from cassandra, then nodetool repair
 does not need to be run at all. Is this understanding correct? If wrong,
 can anyone provide other ways inconsistencies could occur?


 Even if you never delete data you should run repairs occasionally to
 ensure overall consistency. While hinted handoffs and read repairs do lead
 to better consistency, they are only helpers/optimization and are not
 regarded as operations that ensure consistency.

 2. Want to understand the performance of 'nodetool repair' in a Cassandra
 multi data center setup. As we add nodes to the cluster in various data
 centers, does the performance of nodetool repair on each node increase
 linearly, or is it quadratic ?


 Its difficult to calculate the performance of a repair, I've seen the time
 to completion fluctuate between 4hrs to 10hrs+ on the same node. However in
 theory adding more nodes would spread the data and free up machine
 resources, thus resulting in more performant repairs.

 The essence of this question is: If I have a keyspace with x number of
 replicas in each data center, do I have to deal with an upper limit on the
 number of data centers/nodes?


 Could you expand on why you believe there would be an upper limit of
 dc/nodes due to running repairs?


 Mark


 On Tue, Aug 12, 2014 at 10:06 PM, Viswanathan Ramachandran 
 vish.ramachand...@gmail.com wrote:

  Some questions on nodetool repair.

 1. This tool repairs inconsistencies across replicas of the row. Since
 latest update always wins, I dont see inconsistencies other than ones
 resulting from the combination of deletes, tombstones, and crashed nodes.
 Technically, if data is never deleted from cassandra, then nodetool repair
 does not need to be run at all. Is this understanding correct? If wrong,
 can anyone provide other ways inconsistencies could occur?

 2. Want to understand the performance of 'nodetool repair' in a Cassandra
 multi data center setup. As we add nodes to the cluster in various data
 centers, does the performance of nodetool repair on each node increase
 linearly, or is it quadratic ? The essence of this question is: If I have a
 keyspace with x number of replicas in each data center, do I have to deal
 with an upper limit on the number of data centers/nodes?


 Thanks

 Vish





Re: Nodetool Repair questions

2014-08-12 Thread Viswanathan Ramachandran
Andrey, QUORUM consistency and no deletes makes perfect sense.
I believe we could modify that to EACH_QUORUM or QUORUM consistency and no
deletes - isnt that right ?

Thanks


On Tue, Aug 12, 2014 at 3:10 PM, Andrey Ilinykh ailin...@gmail.com wrote:

 1. You don't have to repair if you use QUORUM consistency and you don't
 delete data.
 2.Performance depends on size of data each node has. It's very difficult
 to predict. It may take days.

 Thank you,
   Andrey



 On Tue, Aug 12, 2014 at 2:06 PM, Viswanathan Ramachandran 
 vish.ramachand...@gmail.com wrote:

 Some questions on nodetool repair.

 1. This tool repairs inconsistencies across replicas of the row. Since
 latest update always wins, I dont see inconsistencies other than ones
 resulting from the combination of deletes, tombstones, and crashed nodes.
 Technically, if data is never deleted from cassandra, then nodetool repair
 does not need to be run at all. Is this understanding correct? If wrong,
 can anyone provide other ways inconsistencies could occur?

 2. Want to understand the performance of 'nodetool repair' in a Cassandra
 multi data center setup. As we add nodes to the cluster in various data
 centers, does the performance of nodetool repair on each node increase
 linearly, or is it quadratic ? The essence of this question is: If I have a
 keyspace with x number of replicas in each data center, do I have to deal
 with an upper limit on the number of data centers/nodes?


 Thanks

 Vish





range query times out (on 1 node, just 1 row in table)

2014-08-12 Thread Ian Rose
Hi -

I am currently running a single Cassandra node on my local dev machine.
 Here is my (test) schema (which is meaningless, I created it just to
demonstrate the issue I am running into):

CREATE TABLE foo (
  foo_name ascii,
  foo_shard bigint,
  int_val bigint,
  PRIMARY KEY ((foo_name, foo_shard))
) WITH read_repair_chance=0.1;

CREATE INDEX ON foo (int_val);
CREATE INDEX ON foo (foo_name);

I have inserted just a single row into this table:
insert into foo(foo_name, foo_shard, int_val) values('dave', 27, 100);

This query works fine:
select * from foo where foo_name='dave';

But when I run this query, I get an RPC timeout:
select * from foo where foo_name='dave' and int_val  0 allow filtering;

With tracing enabled, here is the trace output:
http://pastebin.com/raw.php?i=6XMEVUcQ

(In short, everything looks fine to my untrained eye until 10s elapsed, at
which time the following event is logged: Timed out; received 0 of 1
responses for range 257 of 257)

Can anyone help interpret this error?

Many thanks!
Ian


Re: Nodetool Repair questions

2014-08-12 Thread Andrey Ilinykh
On Tue, Aug 12, 2014 at 4:46 PM, Viswanathan Ramachandran 
vish.ramachand...@gmail.com wrote:

 Andrey, QUORUM consistency and no deletes makes perfect sense.
 I believe we could modify that to EACH_QUORUM or QUORUM consistency and no
 deletes - isnt that right?


 yes.