unable to find sufficient sources for streaming range

2014-07-02 Thread Daning Wang
We are running Cassandra 1.2.5

We have 8 nodes cluster, and we removed one machine from cluster and try to
add it back(the purpose is we are using vnodes, some node has more tokens
so by rejoining this machine we hope it could get some loads from the busy
machines). But we got following exception and the node cannot add to the
ring anymore.

Please help,

Thanks in advance,


 INFO 16:01:56,260 JOINING: Starting to bootstrap...
ERROR 16:01:56,514 Exception encountered during startup
java.lang.IllegalStateException: unable to find sufficient sources for
streaming range
(131921530760098415548184818173535242096,132123583169200197961735373586277861750]
at
org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:205)
at
org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:129)
at
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81)
at
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924)
at
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693)
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:548)
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:445)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325)
at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456)
java.lang.IllegalStateException: unable to find sufficient sources for
streaming range
(131921530760098415548184818173535242096,132123583169200197961735373586277861750]
at
org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:205)
at
org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:129)
at
org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:81)
at
org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:924)
at
org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:693)
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:548)
at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:445)
at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:325)
at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:413)
at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:456)
Exception encountered during startup: unable to find sufficient sources for
streaming range
(131921530760098415548184818173535242096,132123583169200197961735373586277861750]
ERROR 16:01:56,518 Exception in thread
Thread[StorageServiceShutdownHook,5,main]
java.lang.NullPointerException
at
org.apache.cassandra.service.StorageService.stopRPCServer(StorageService.java:321)
at
org.apache.cassandra.service.StorageService.shutdownClientServers(StorageService.java:362)
at
org.apache.cassandra.service.StorageService.access$000(StorageService.java:88)
at
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:513)


Daning


Bulk writes and key cache

2014-02-03 Thread Daning Wang
Does Cassandra put keys in key cache during the write path?

If I have two tables, the key cache for the first table was warmed up
nicely, and I want to insert millions rows in the second table, and there
is no read on the second table yet, will that affect cache hit ratio for
the first table?

Thanks,

Daning


Move token to another node on 1.2.x

2013-11-07 Thread Daning Wang
How to move a token to another node on 1.2.x? I have tried move command,

[cassy@dsat103.e1a ~]$ nodetool move 168755834953206242653616795390304335559
Exception in thread main java.io.IOException: target token
168755834953206242653616795390304335559 is already owned by another node.
at
org.apache.cassandra.service.StorageService.move(StorageService.java:2908)
at
org.apache.cassandra.service.StorageService.move(StorageService.java:2892)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)


change the token number a bit

[cassy@dsat103.e1a ~]$ nodetool -h localhost move
168755834953206242653616795390304335560
This node has more than one token and cannot be moved thusly

We don't want to use cassandra-shuffle, because it put too much load on the
server. we just want to move some tokens.

Thanks,

Daning


Re: ReadCount change rate is different across nodes

2013-10-30 Thread Daning Wang
Thanks. actually I forgot to mention it is multi-center environment and we
have dynamic snitch disabled. because we saw some performance impact on the
multi-center environment.





On Wed, Oct 30, 2013 at 11:12 AM, Piavlo lolitus...@gmail.com wrote:

 On 10/30/2013 02:06 AM, Daning Wang wrote:

 We are running 1.2.5 on 8 nodes(256 tokens). all the nodes are running on
 same type of machine. and db size is about same. but recently we checked
 ReadCount stats through jmx, and found that some nodes got  3 times change
 rate(we have calculated the changes per minute)  than others.

 We are using hector on client side, and clients are connecting to all the
 servers, we checked open connections on each server, the numbers are about
 same.

 What could cause this problem, how to debug this?

 check per node reads latency CF metrics, and i guess you have dynamic
 snitch enabled?
 http://www.datastax.com/dev/**blog/dynamic-snitching-in-**
 cassandra-past-present-and-**futurehttp://www.datastax.com/dev/blog/dynamic-snitching-in-cassandra-past-present-and-future



 Thanks in advance,

 Daning





ReadCount change rate is different across nodes

2013-10-29 Thread Daning Wang
We are running 1.2.5 on 8 nodes(256 tokens). all the nodes are running on
same type of machine. and db size is about same. but recently we checked
ReadCount stats through jmx, and found that some nodes got  3 times change
rate(we have calculated the changes per minute)  than others.

We are using hector on client side, and clients are connecting to all the
servers, we checked open connections on each server, the numbers are about
same.

What could cause this problem, how to debug this?


Thanks in advance,

Daning


Key cache size

2013-09-04 Thread Daning Wang
We noticed that key cache could not be fully populated, we have set the key
cache size to 1024M.

key_cache_size_in_mb: 1024

But none of nodes showed the cache capacity is 1G, we have recently
upgraded to 1.2.5, could be an issue in that version?

Token: (invoke with -T/--tokens to see all 256 tokens)
ID   : 0fd912fb-3187-462b-8c8a-7d223751b649
Gossip active: true
Thrift active: true
Load : 73.16 GB
Generation No: 1372374984
Uptime (seconds) : 5953779
Heap Memory (MB) : 5440.59 / 10035.25
Data Center  : dc1
Rack : rac1
Exceptions   : 34601
Key Cache: size 540060752 (bytes), capacity 540060796 (bytes),
12860975403 hits, 15535054378 requests, 0.839 recent hit rate, 14400 save
period in seconds
Row Cache: size 0 (bytes), capacity 0 (bytes), 0 hits, 0 requests,
NaN recent hit rate, 0 save period in seconds

Thanks,

Daninng


Dynamic Snitch and EC2MultiRegionSnitch

2013-07-01 Thread Daning Wang
How does dynamic snitch work with EC2MultiRegionSnitch? Can dynamic routing
only happen in one data center? We don't wan to have the requests routed to
another center even nodes are idle in other side since the network could be
slow.

Thanks in advance,

Daning


Re: Multiple data center performance

2013-06-12 Thread Daning Wang
Sorry for the confusion.

Sylvain -  Do you think what could cause the client higher latency in
multiDC(CL=one for read and write) ? clients only connect to nodes in the
same DC. we did see the performance greatly improved after changing the
replication factor for counters, but still slower than other DC is shutdown.


Thanks,

Daning



On Wed, Jun 12, 2013 at 7:48 AM, Sylvain Lebresne sylv...@datastax.comwrote:


 Is there something special of this kind regarding counters over multiDC ?


 No. Counters behave exactly as other writes as far the consistency level
 is concerned.
 Technically, the counter write path is different from the normal write
 path in the sense that a counter write
 will be written to one replica first and then written to the rest of the
 replicas in a second time (with a local
 read on the first replica in between, which is why counter writes are
 slower than normal ones). But,
 outside of the obvious performance impact, this has no impact on the
 behavior observed from a
 client point of view. The consistency level has the exact same meaning in
 particular (though one
 small difference is that counters don't support CL.ANY).

 --
 Sylvain



 Thank you anyway Sylvain


 2013/6/12 Sylvain Lebresne sylv...@datastax.com

 It is the normal behavior, but that's true of any update, not only of
 counters.

 The consistency level does *not* influence which replica are written to.
 Cassandra always write to all replicas. The consistency level only decides
 how replica acknowledgement are waited for.

 --
 Sylvain


 On Wed, Jun 12, 2013 at 4:56 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 counter will replicate to all replicas during write regardless the
 consistency level

 I that the normal behavior or a bug ?


 2013/6/11 Daning Wang dan...@netseer.com

 It is counter caused the problem. counter will replicate to all
 replicas during write regardless the consistency level.

 In our case. we don't need to sync the counter across the center. so
 moving counter to new keyspace and all the replica in one
 center solved problem.

 There is option replicate_on_write on table. If you turn that off for
 counter might have better performance. but you are on high risk to lose
 data and create inconsistency. I did not try this option.

 Daning


 On Sat, Jun 8, 2013 at 6:53 AM, srmore comom...@gmail.com wrote:

 I am seeing the similar behavior, in my case I have 2 nodes in each
 datacenter and one node always has high latency (equal to the latency
 between the two datacenters). When one of the datacenters is shutdown the
 latency drops.

 I am curious to know whether anyone else has these issues and if yes
 how did to get around it.

 Thanks !


 On Fri, Jun 7, 2013 at 11:49 PM, Daning Wang dan...@netseer.comwrote:

 We have deployed multi-center but got performance issue. When the
 nodes on other center are up, the read response time from clients is 4 
 or 5
 times higher. when we take those nodes down, the response time becomes
 normal(compare to the time before we changed to multi-center).

 We have high volume on the cluster, the consistency level is one for
 read. so my understanding is most of traffic between data center should 
 be
 read repair. but seems that could not create much delay.

 What could cause the problem? how to debug this?

 Here is the keyspace,

 [default@dsat] describe dsat;
 Keyspace: dsat:
   Replication Strategy:
 org.apache.cassandra.locator.NetworkTopologyStrategy
   Durable Writes: true
 Options: [dc2:1, dc1:3]
   Column Families:
 ColumnFamily: categorization_cache


 Ring

 Datacenter: dc1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
 Rack
 UN  xx.xx.xx..111   59.2 GB256 37.5%
 4d6ed8d6-870d-4963-8844-08268607757e  rac1
 DN  xx.xx.xx..121   99.63 GB   256 37.5%
 9d0d56ce-baf6-4440-a233-ad6f1d564602  rac1
 UN  xx.xx.xx..120   66.32 GB   256 37.5%
 0fd912fb-3187-462b-8c8a-7d223751b649  rac1
 UN  xx.xx.xx..118   63.61 GB   256 37.5%
 3c6e6862-ab14-4a8c-9593-49631645349d  rac1
 UN  xx.xx.xx..117   68.16 GB   256 37.5%
 ee6cdf23-d5e4-4998-a2db-f6c0ce41035a  rac1
 UN  xx.xx.xx..116   32.41 GB   256 37.5%
 f783eeef-1c51-4f91-ab7c-a60669816770  rac1
 UN  xx.xx.xx..115   64.24 GB   256 37.5%
 e75105fb-b330-4f40-aa4f-8e6e11838e37  rac1
 UN  xx.xx.xx..112   61.32 GB   256 37.5%
 2547ee54-88dd-4994-a1ad-d9ba367ed11f  rac1
 Datacenter: dc2
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
 Rack
 DN  xx.xx.xx.19958.39 GB   256 50.0%
 6954754a-e9df-4b3c-aca7-146b938515d8  rac1
 DN  xx.xx.xx..61  33.79 GB   256 50.0%
 91b8d510-966a-4f2d-a666-d7edbe986a1c  rac1


 Thank you in advance,

 Daning










Re: Multiple data center performance

2013-06-11 Thread Daning Wang
It is counter caused the problem. counter will replicate to all replicas
during write regardless the consistency level.

In our case. we don't need to sync the counter across the center. so moving
counter to new keyspace and all the replica in one center solved problem.

There is option replicate_on_write on table. If you turn that off for
counter might have better performance. but you are on high risk to lose
data and create inconsistency. I did not try this option.

Daning


On Sat, Jun 8, 2013 at 6:53 AM, srmore comom...@gmail.com wrote:

 I am seeing the similar behavior, in my case I have 2 nodes in each
 datacenter and one node always has high latency (equal to the latency
 between the two datacenters). When one of the datacenters is shutdown the
 latency drops.

 I am curious to know whether anyone else has these issues and if yes how
 did to get around it.

 Thanks !


 On Fri, Jun 7, 2013 at 11:49 PM, Daning Wang dan...@netseer.com wrote:

 We have deployed multi-center but got performance issue. When the nodes
 on other center are up, the read response time from clients is 4 or 5 times
 higher. when we take those nodes down, the response time becomes
 normal(compare to the time before we changed to multi-center).

 We have high volume on the cluster, the consistency level is one for
 read. so my understanding is most of traffic between data center should be
 read repair. but seems that could not create much delay.

 What could cause the problem? how to debug this?

 Here is the keyspace,

 [default@dsat] describe dsat;
 Keyspace: dsat:
   Replication Strategy:
 org.apache.cassandra.locator.NetworkTopologyStrategy
   Durable Writes: true
 Options: [dc2:1, dc1:3]
   Column Families:
 ColumnFamily: categorization_cache


 Ring

 Datacenter: dc1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
 Rack
 UN  xx.xx.xx..111   59.2 GB256 37.5%
 4d6ed8d6-870d-4963-8844-08268607757e  rac1
 DN  xx.xx.xx..121   99.63 GB   256 37.5%
 9d0d56ce-baf6-4440-a233-ad6f1d564602  rac1
 UN  xx.xx.xx..120   66.32 GB   256 37.5%
 0fd912fb-3187-462b-8c8a-7d223751b649  rac1
 UN  xx.xx.xx..118   63.61 GB   256 37.5%
 3c6e6862-ab14-4a8c-9593-49631645349d  rac1
 UN  xx.xx.xx..117   68.16 GB   256 37.5%
 ee6cdf23-d5e4-4998-a2db-f6c0ce41035a  rac1
 UN  xx.xx.xx..116   32.41 GB   256 37.5%
 f783eeef-1c51-4f91-ab7c-a60669816770  rac1
 UN  xx.xx.xx..115   64.24 GB   256 37.5%
 e75105fb-b330-4f40-aa4f-8e6e11838e37  rac1
 UN  xx.xx.xx..112   61.32 GB   256 37.5%
 2547ee54-88dd-4994-a1ad-d9ba367ed11f  rac1
 Datacenter: dc2
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns (effective)  Host ID
 Rack
 DN  xx.xx.xx.19958.39 GB   256 50.0%
 6954754a-e9df-4b3c-aca7-146b938515d8  rac1
 DN  xx.xx.xx..61  33.79 GB   256 50.0%
 91b8d510-966a-4f2d-a666-d7edbe986a1c  rac1


 Thank you in advance,

 Daning





replication factor is zero

2013-06-06 Thread Daning Wang
We have multi-center deployment. data from some tables we don't want to
sync to other center. could we set replication factor to 0 on other data
center? what is the best to way for not syncing some data in a cluster?

Thanks in advance,

Daning


How to change existing cluster to multi-center

2013-04-25 Thread Daning Wang
Hi All,

We have 8 nodes cluster(replication factor is 3), about 50G data on each
node. we need to change the cluster to multi-center environment(to EC2).
the data need to have one replica on ec2.

Here is the plan,

- Change cluster config to mult-center.
- Add 2 or 3 nodes in another center, which is ec2.
- Change the replication factor to make data synced to other center.

We have not done the test yet, is this doable? the main concern is that
since connection to ec2 is slow, it will take longer time to streaming
data(should be more than 100G) at the beginning.

Anybody has done this before, please share some light,

Thanks in advance,

Daning


Cassandra remote backup solution

2013-04-25 Thread Daning Wang
Hi Guys,

What is the cassandra solution for remote backup besides multi-center?  I
hope I can do incremental backup to remote database center.

Thanks,

Daning


Re: Upgrade to Cassandra 1.2

2013-02-14 Thread Daning Wang
Thanks Aaron and Manu.

Since we are using 1.1, there is no num_taken parameter. when I upgrade to
1.2, should I set num_token=1 to start up,  or I can set to other numbers?

Daning




On Tue, Feb 12, 2013 at 3:45 PM, Manu Zhang owenzhang1...@gmail.com wrote:

 num_tokens is only used at bootstrap

 I think it's also used in this case (already bootstrapped with num_tokens
 = 1 and now num_tokens  1). Cassandra will split a node's current range
 into *num_tokens* parts and there should be no change to the amount of ring
 a node holds before shuffling.


 On Wed, Feb 13, 2013 at 3:12 AM, aaron morton aa...@thelastpickle.comwrote:

 Restore the settings for num_tokens and intial_token to what they were
 before you upgraded.
 They should not be changed just because you are upgrading to 1.2, they
 are used to enable virtual nodes. Which are not necessary to run 1.2.

 Cheers


-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 13/02/2013, at 8:02 AM, Daning Wang dan...@netseer.com wrote:

 No, I did not run shuffle since the upgrade was not successful.

 what do you mean reverting the changes to num_tokens and inital_token?
 set num_tokens=1? initial_token should be ignored since it is not
 bootstrap. right?

 Thanks,

 Daning

 On Tue, Feb 12, 2013 at 10:52 AM, aaron morton 
 aa...@thelastpickle.comwrote:

 Were you upgrading to 1.2 AND running the shuffle or just upgrading to
 1.2?

 If you have not run shuffle I would suggest reverting the changes to
 num_tokens and inital_token. This is a guess because num_tokens is only
 used at bootstrap.

 Just get upgraded to 1.2 first, then do the shuffle when things are
 stable.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 12/02/2013, at 2:55 PM, Daning Wang dan...@netseer.com wrote:

 Thanks Aaron.

 I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed.

 - I followed http://www.datastax.com/docs/1.2/install/upgrading, have
 merged cassandra.yaml, with follow parameter

 num_tokens: 256
 #initial_token: 0

 the initial_token is commented out, current token should be obtained
 from system schema

 - I did rolling upgrade, during the upgrade, I got Borken Pipe error
 from the nodes with old version, is that normal?

 - After I upgraded 3 nodes(still have 5 to go), I found it is total
 wrong, the first node upgraded owns 99.2 of ring

 [cassy@d5:/usr/local/cassy conf]$  ~/bin/nodetool -h localhost status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns   Host ID
   Rack
 DN  10.210.101.11745.01 GB   254 99.2%
  f4b6afe3-7e2e-4c61-96e8-12a529a31373  rack1
 UN  10.210.101.12045.43 GB   256 0.4%
 0fd912fb-3187-462b-8c8a-7d223751b649  rack1
 UN  10.210.101.11127.08 GB   256 0.4%
 bd4c37bc-07dd-488b-bfab-e74e32c26f6e  rack1


 What was wrong? please help. I could provide more information if you
 need.

 Thanks,

 Daning



 On Mon, Feb 4, 2013 at 9:16 AM, aaron morton aa...@thelastpickle.comwrote:

 There is a command line utility in 1.2 to shuffle the tokens…

 http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes

 $ ./cassandra-shuffle --help
 Missing sub-command argument.
 Usage: shuffle [options] sub-command

 Sub-commands:
  create   Initialize a new shuffle operation
  ls   List pending relocations
  clearClear pending relocations
  en[able] Enable shuffling
  dis[able]Disable shuffling

 Options:
  -dc,  --only-dc   Apply only to named DC (create only)
  -tp,  --thrift-port   Thrift port number (Default: 9160)
  -p,   --port  JMX port number (Default: 7199)
  -tf,  --thrift-framed Enable framed transport for Thrift (Default:
 false)
  -en,  --and-enableImmediately enable shuffling (create only)
  -H,   --help  Print help information
  -h,   --host  JMX hostname or IP address (Default:
 localhost)
  -th,  --thrift-host   Thrift hostname or IP address (Default: JMX
 host)

 Cheers

-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 3/02/2013, at 11:32 PM, Manu Zhang owenzhang1...@gmail.com wrote:

 On Sun 03 Feb 2013 05:45:56 AM CST, Daning Wang wrote:

 I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is
 that it can have multiple tokens in one node. but there is only one
 token in 1.1.6.

 how can I upgrade to 1.2.1 then breaking the token to take advantage
 of this feature? I went through this doc but it does not say how to
 change the num_token

 http://www.datastax.com/docs/1.2/install/upgrading

 Is there other doc about this upgrade path?

 Thanks,

 Daning


 I think for each node you need to change the num_token

Re: Upgrade to Cassandra 1.2

2013-02-14 Thread Daning Wang
Thanks! suppose I  can upgrade to 1.2.x with 1 token by commenting out
num_tokens, how can I changed to multiple tokens? could not find doc
clearly stating about this.


On Thu, Feb 14, 2013 at 10:54 AM, Alain RODRIGUEZ arodr...@gmail.comwrote:

 From:
 http://www.datastax.com/docs/1.2/configuration/node_configuration#num-tokens

 About num_tokens: If left unspecified, Cassandra uses the default value
 of 1 token (for legacy compatibility) and uses the initial_token. If you
 already have a cluster with one token per node, and wish to migrate to
 multiple tokens per node.

 So I would let #num_tokens commented in the cassandra.yaml and would
 set the initial_token at the same value than in the pre-C*1.2.x-uprage
 configuration.

 Alain


 2013/2/14 Daning Wang dan...@netseer.com

 Thanks Aaron and Manu.

 Since we are using 1.1, there is no num_taken parameter. when I upgrade
 to 1.2, should I set num_token=1 to start up,  or I can set to other
 numbers?

 Daning




 On Tue, Feb 12, 2013 at 3:45 PM, Manu Zhang owenzhang1...@gmail.comwrote:

 num_tokens is only used at bootstrap

 I think it's also used in this case (already bootstrapped with
 num_tokens = 1 and now num_tokens  1). Cassandra will split a node's
 current range into *num_tokens* parts and there should be no change to the
 amount of ring a node holds before shuffling.


 On Wed, Feb 13, 2013 at 3:12 AM, aaron morton 
 aa...@thelastpickle.comwrote:

 Restore the settings for num_tokens and intial_token to what they were
 before you upgraded.
 They should not be changed just because you are upgrading to 1.2, they
 are used to enable virtual nodes. Which are not necessary to run 1.2.

 Cheers


-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 13/02/2013, at 8:02 AM, Daning Wang dan...@netseer.com wrote:

 No, I did not run shuffle since the upgrade was not successful.

 what do you mean reverting the changes to num_tokens and
 inital_token? set num_tokens=1? initial_token should be ignored since it
 is not bootstrap. right?

 Thanks,

 Daning

 On Tue, Feb 12, 2013 at 10:52 AM, aaron morton aa...@thelastpickle.com
  wrote:

 Were you upgrading to 1.2 AND running the shuffle or just upgrading to
 1.2?

 If you have not run shuffle I would suggest reverting the changes to
 num_tokens and inital_token. This is a guess because num_tokens is only
 used at bootstrap.

 Just get upgraded to 1.2 first, then do the shuffle when things are
 stable.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 12/02/2013, at 2:55 PM, Daning Wang dan...@netseer.com wrote:

 Thanks Aaron.

 I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed.

 - I followed http://www.datastax.com/docs/1.2/install/upgrading, have
 merged cassandra.yaml, with follow parameter

 num_tokens: 256
 #initial_token: 0

 the initial_token is commented out, current token should be obtained
 from system schema

 - I did rolling upgrade, during the upgrade, I got Borken Pipe error
 from the nodes with old version, is that normal?

 - After I upgraded 3 nodes(still have 5 to go), I found it is total
 wrong, the first node upgraded owns 99.2 of ring

 [cassy@d5:/usr/local/cassy conf]$  ~/bin/nodetool -h localhost status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns   Host ID
 Rack
 DN  10.210.101.11745.01 GB   254 99.2%
  f4b6afe3-7e2e-4c61-96e8-12a529a31373  rack1
 UN  10.210.101.12045.43 GB   256 0.4%
 0fd912fb-3187-462b-8c8a-7d223751b649  rack1
 UN  10.210.101.11127.08 GB   256 0.4%
 bd4c37bc-07dd-488b-bfab-e74e32c26f6e  rack1


 What was wrong? please help. I could provide more information if you
 need.

 Thanks,

 Daning



 On Mon, Feb 4, 2013 at 9:16 AM, aaron morton 
 aa...@thelastpickle.comwrote:

 There is a command line utility in 1.2 to shuffle the tokens…


 http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes

 $ ./cassandra-shuffle --help
 Missing sub-command argument.
 Usage: shuffle [options] sub-command

 Sub-commands:
  create   Initialize a new shuffle operation
  ls   List pending relocations
  clearClear pending relocations
  en[able] Enable shuffling
  dis[able]Disable shuffling

 Options:
  -dc,  --only-dc   Apply only to named DC (create only)
  -tp,  --thrift-port   Thrift port number (Default: 9160)
  -p,   --port  JMX port number (Default: 7199)
  -tf,  --thrift-framed Enable framed transport for Thrift
 (Default: false)
  -en,  --and-enableImmediately enable shuffling (create only)
  -H,   --help  Print help information
  -h,   --host  JMX hostname or IP address (Default:
 localhost)
  -th,  --thrift-host   Thrift

Re: Upgrade to Cassandra 1.2

2013-02-12 Thread Daning Wang
No, I did not run shuffle since the upgrade was not successful.

what do you mean reverting the changes to num_tokens and inital_token?
set num_tokens=1? initial_token should be ignored since it is not
bootstrap. right?

Thanks,

Daning

On Tue, Feb 12, 2013 at 10:52 AM, aaron morton aa...@thelastpickle.comwrote:

 Were you upgrading to 1.2 AND running the shuffle or just upgrading to
 1.2?

 If you have not run shuffle I would suggest reverting the changes to
 num_tokens and inital_token. This is a guess because num_tokens is only
 used at bootstrap.

 Just get upgraded to 1.2 first, then do the shuffle when things are
 stable.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 12/02/2013, at 2:55 PM, Daning Wang dan...@netseer.com wrote:

 Thanks Aaron.

 I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed.

 - I followed http://www.datastax.com/docs/1.2/install/upgrading, have
 merged cassandra.yaml, with follow parameter

 num_tokens: 256
 #initial_token: 0

 the initial_token is commented out, current token should be obtained from
 system schema

 - I did rolling upgrade, during the upgrade, I got Borken Pipe error
 from the nodes with old version, is that normal?

 - After I upgraded 3 nodes(still have 5 to go), I found it is total wrong,
 the first node upgraded owns 99.2 of ring

 [cassy@d5:/usr/local/cassy conf]$  ~/bin/nodetool -h localhost status
 Datacenter: datacenter1
 ===
 Status=Up/Down
 |/ State=Normal/Leaving/Joining/Moving
 --  Address   Load   Tokens  Owns   Host ID
 Rack
 DN  10.210.101.11745.01 GB   254 99.2%
  f4b6afe3-7e2e-4c61-96e8-12a529a31373  rack1
 UN  10.210.101.12045.43 GB   256 0.4%
 0fd912fb-3187-462b-8c8a-7d223751b649  rack1
 UN  10.210.101.11127.08 GB   256 0.4%
 bd4c37bc-07dd-488b-bfab-e74e32c26f6e  rack1


 What was wrong? please help. I could provide more information if you need.

 Thanks,

 Daning



 On Mon, Feb 4, 2013 at 9:16 AM, aaron morton aa...@thelastpickle.comwrote:

 There is a command line utility in 1.2 to shuffle the tokens…

 http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes

 $ ./cassandra-shuffle --help
 Missing sub-command argument.
 Usage: shuffle [options] sub-command

 Sub-commands:
  create   Initialize a new shuffle operation
  ls   List pending relocations
  clearClear pending relocations
  en[able] Enable shuffling
  dis[able]Disable shuffling

 Options:
  -dc,  --only-dc   Apply only to named DC (create only)
  -tp,  --thrift-port   Thrift port number (Default: 9160)
  -p,   --port  JMX port number (Default: 7199)
  -tf,  --thrift-framed Enable framed transport for Thrift (Default:
 false)
  -en,  --and-enableImmediately enable shuffling (create only)
  -H,   --help  Print help information
  -h,   --host  JMX hostname or IP address (Default: localhost)
  -th,  --thrift-host   Thrift hostname or IP address (Default: JMX
 host)

 Cheers

-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 3/02/2013, at 11:32 PM, Manu Zhang owenzhang1...@gmail.com wrote:

 On Sun 03 Feb 2013 05:45:56 AM CST, Daning Wang wrote:

 I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is
 that it can have multiple tokens in one node. but there is only one
 token in 1.1.6.

 how can I upgrade to 1.2.1 then breaking the token to take advantage
 of this feature? I went through this doc but it does not say how to
 change the num_token

 http://www.datastax.com/docs/1.2/install/upgrading

 Is there other doc about this upgrade path?

 Thanks,

 Daning


 I think for each node you need to change the num_token option in
 conf/cassandra.yaml (this only split the current range into num_token
 parts) and run the bin/cassandra-shuffle command (this spread it all over
 the ring).







Re: Upgrade to Cassandra 1.2

2013-02-11 Thread Daning Wang
Thanks Aaron.

I tried to migrate existing cluster(ver 1.1.0) to 1.2.1 but failed.

- I followed http://www.datastax.com/docs/1.2/install/upgrading, have
merged cassandra.yaml, with follow parameter

num_tokens: 256
#initial_token: 0

the initial_token is commented out, current token should be obtained from
system schema

- I did rolling upgrade, during the upgrade, I got Borken Pipe error from
the nodes with old version, is that normal?

- After I upgraded 3 nodes(still have 5 to go), I found it is total wrong,
the first node upgraded owns 99.2 of ring

[cassy@d5:/usr/local/cassy conf]$  ~/bin/nodetool -h localhost status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address   Load   Tokens  Owns   Host ID
  Rack
DN  10.210.101.11745.01 GB   254 99.2%
 f4b6afe3-7e2e-4c61-96e8-12a529a31373  rack1
UN  10.210.101.12045.43 GB   256 0.4%
0fd912fb-3187-462b-8c8a-7d223751b649  rack1
UN  10.210.101.11127.08 GB   256 0.4%
bd4c37bc-07dd-488b-bfab-e74e32c26f6e  rack1


What was wrong? please help. I could provide more information if you need.

Thanks,

Daning



On Mon, Feb 4, 2013 at 9:16 AM, aaron morton aa...@thelastpickle.comwrote:

 There is a command line utility in 1.2 to shuffle the tokens…

 http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes

 $ ./cassandra-shuffle --help
 Missing sub-command argument.
 Usage: shuffle [options] sub-command

 Sub-commands:
  create   Initialize a new shuffle operation
  ls   List pending relocations
  clearClear pending relocations
  en[able] Enable shuffling
  dis[able]Disable shuffling

 Options:
  -dc,  --only-dc   Apply only to named DC (create only)
  -tp,  --thrift-port   Thrift port number (Default: 9160)
  -p,   --port  JMX port number (Default: 7199)
  -tf,  --thrift-framed Enable framed transport for Thrift (Default:
 false)
  -en,  --and-enableImmediately enable shuffling (create only)
  -H,   --help  Print help information
  -h,   --host  JMX hostname or IP address (Default: localhost)
  -th,  --thrift-host   Thrift hostname or IP address (Default: JMX
 host)

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 3/02/2013, at 11:32 PM, Manu Zhang owenzhang1...@gmail.com wrote:

 On Sun 03 Feb 2013 05:45:56 AM CST, Daning Wang wrote:

 I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is
 that it can have multiple tokens in one node. but there is only one
 token in 1.1.6.

 how can I upgrade to 1.2.1 then breaking the token to take advantage
 of this feature? I went through this doc but it does not say how to
 change the num_token

 http://www.datastax.com/docs/1.2/install/upgrading

 Is there other doc about this upgrade path?

 Thanks,

 Daning


 I think for each node you need to change the num_token option in
 conf/cassandra.yaml (this only split the current range into num_token
 parts) and run the bin/cassandra-shuffle command (this spread it all over
 the ring).





Cassandra jmx stats ReadCount

2013-02-07 Thread Daning Wang
We have 8 nodes cluster in Casandra 1.1.0, with replication factor is 3. We
found that when you just insert data, not only WriteCount increases, the
ReadCount also increases.

How could this happen? I am under the impression that readCount only counts
the reads from client.

Thanks,

Daning


Upgrade to Cassandra 1.2

2013-02-02 Thread Daning Wang
I'd like to upgrade from 1.1.6 to 1.2.1, one big feature in 1.2 is that it
can have multiple tokens in one node. but there is only one token in 1.1.6.

how can I upgrade to 1.2.1 then breaking the token to take advantage of
this feature? I went through this doc but it does not say how to change the
num_token

http://www.datastax.com/docs/1.2/install/upgrading

 Is there other doc about this upgrade path?

Thanks,

Daning


Problem on node join the ring

2013-01-28 Thread Daning Wang
I add a new node to ring(version 1.1.6), after more than 30 hours, it is
still in the 'Joining' state

Address DC  RackStatus State   Load
 Effective-Ownership Token

   141784319550391026443072753096570088105
10.28.78.123datacenter1 rack1   Up Normal  18.73 GB
 50.00%  0
10.4.17.138 datacenter1 rack1   Up Normal  15 GB
39.29%  24305883351495604533098186245126300818
10.93.95.51 datacenter1 rack1   Up Normal  17.96 GB
 41.67%  42535295865117307932921825928971026432
10.170.1.26 datacenter1 rack1   Up Joining 6.89 GB
0.00%   56713727820156410577229101238628035242
10.6.115.239datacenter1 rack1   Up Normal  20.3 GB
50.00%  85070591730234615865843651857942052864
10.28.20.200datacenter1 rack1   Up Normal  22.68 GB
 60.71%  127605887595351923798765477786913079296
10.240.113.171  datacenter1 rack1   Up Normal  18.4 GB
58.33%  141784319550391026443072753096570088105


since after a while, the cpu usage goes down to 0, looks it is stuck. I
have restarted server several times in last 30 hours. when server is just
started, you can see streaming in 'nodetool netstats', but after a few
minutes, there is no streaming anymore

I have turned on the debug, this is what it is doing now(cpu is pretty much
idle), no any error message.

Please help, I can provide more info if needed.

Thanks in advance,


DEBUG [MutationStage:17] 2013-01-28 12:47:59,618
RowMutationVerbHandler.java (line 44) Applying RowMutation(keyspace='dsat',
key='52f5298affbb8bf0', modifications=[ColumnFamily(dsatcache
[_meta:false:278@1359406079725000!3888000,])])
DEBUG [MutationStage:17] 2013-01-28 12:47:59,618 Table.java (line 395)
applying mutation of row 52f5298affbb8bf0
DEBUG [MutationStage:17] 2013-01-28 12:47:59,618
RowMutationVerbHandler.java (line 56) RowMutation(keyspace='dsat',
key='52f5298affbb8bf0', modifications=[ColumnFamily(dsatcache
[_meta:false:278@1359406079725000!3888000,])]) applied.  Sending response
to 571645593@/10.28.78.123
DEBUG [MutationStage:26] 2013-01-28 12:47:59,623
RowMutationVerbHandler.java (line 44) Applying RowMutation(keyspace='dsat',
key='57f700499922964b', modifications=[ColumnFamily(dsatcache
[cache_type:false:8@1359406079730002,path:false:30@1359406079730001
,top_node:false:22@135940607973,v0:false:976@1359406079730003
!3888000,])])
DEBUG [MutationStage:26] 2013-01-28 12:47:59,623 Table.java (line 395)
applying mutation of row 57f700499922964b
DEBUG [MutationStage:26] 2013-01-28 12:47:59,623 Table.java (line 429)
mutating indexed column top_node value
6d617474626f7574726f732e74756d626c722e636f6d
DEBUG [MutationStage:26] 2013-01-28 12:47:59,623 CollationController.java
(line 78) collectTimeOrderedData
DEBUG [MutationStage:26] 2013-01-28 12:47:59,623 Table.java (line 453)
Pre-mutation index row is null
DEBUG [MutationStage:26] 2013-01-28 12:47:59,624 KeysIndex.java (line 119)
applying index row mattboutros.tumblr.com in
ColumnFamily(dsatcache.dsatcache_top_node_idx
[57f700499922964b:false:0@135940607973,])
DEBUG [MutationStage:26] 2013-01-28 12:47:59,624
RowMutationVerbHandler.java (line 56) RowMutation(keyspace='dsat',
key='57f700499922964b', modifications=[ColumnFamily(dsatcache
[cache_type:false:8@1359406079730002,path:false:30@1359406079730001
,top_node:false:22@135940607973,v0:false:976@1359406079730003!3888000,])])
applied.  Sending response to 710680715@/10.28.20.200
DEBUG [MutationStage:22] 2013-01-28 12:47:59,624
RowMutationVerbHandler.java (line 44) Applying RowMutation(keyspace='dsat',
key='57f700499922964b', modifications=[ColumnFamily(dsatcache
[_meta:false:278@1359406079731000!3888000,])])
DEBUG [MutationStage:22] 2013-01-28 12:47:59,624 Table.java (line 395)
applying mutation of row 57f700499922964b
DEBUG [MutationStage:22] 2013-01-28 12:47:59,624
RowMutationVerbHandler.java (line 56) RowMutation(keyspace='dsat',
key='57f700499922964b', modifications=[ColumnFamily(dsatcache
[_meta:false:278@1359406079731000!3888000,])]) applied.  Sending response
to 710680719@/10.28.20.200
DEBUG [MutationStage:25] 2013-01-28 12:47:59,652
RowMutationVerbHandler.java (line 44) Applying RowMutation(keyspace='dsat',
key='2a50083d5332071f', modifications=[ColumnFamily(dsatcache
[cache_type:false:8@1359406079692002,path:false:26@1359406079692001
,top_node:false:18@1359406079692000,v0:false:583@1359406079692003
!3888000,])])
DEBUG [MutationStage:25] 2013-01-28 12:47:59,652 Table.java (line 395)
applying mutation of row 2a50083d5332071f
DEBUG [MutationStage:25] 2013-01-28 12:47:59,652 Table.java (line 429)
mutating indexed column top_node value 772e706163696669632d72652e636f6d
DEBUG [MutationStage:25] 2013-01-28 12:47:59,652 CollationController.java
(line 78) collectTimeOrderedData
DEBUG [MutationStage:25] 2013-01-28 12:47:59,652 Table.java (line 453)
Pre-mutation index row is null
DEBUG [MutationStage:25] 2013-01-28 

1.2 Authentication

2013-01-28 Thread Daning Wang
We were using SimpleAuthenticator on 1.1.x, it worked fine.

While testing 1.2, I have put classes under example/simple_authentication
in a jar and copy to lib directory, the class is loaded. however, when I
try to connect with correct user/password, it gives me error

./cqlsh s2.dsat103-e1a -u  -p 
Traceback (most recent call last):
  File ./cqlsh, line 2262, in module
main(*read_options(sys.argv[1:], os.environ))
  File ./cqlsh, line 2248, in main
display_float_precision=options.float_precision)
  File ./cqlsh, line 483, in __init__
cql_version=cqlver, transport=transport)
  File ./../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/connection.py,
line 143, in connect
  File ./../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/connection.py,
line 59, in __init__
  File ./../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/thrifteries.py,
line 157, in establish_connection
  File
./../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/cassandra/Cassandra.py,
line 455, in login
  File
./../lib/cql-internal-only-1.4.0.zip/cql-1.4.0/cql/cassandra/Cassandra.py,
line 476, in recv_login
cql.cassandra.ttypes.AuthenticationException:
AuthenticationException(why=User  doesn't exist - create it with
CREATE USER query first)


What does create it with CREATE USER query first mean?

I put debug information in SimpleAuthenticator class, that showed
authentication is passed in the authenticate() method.

Thanks,

Daning


Re: Replication factor

2012-05-23 Thread Daning Wang
Thanks guys.

Aaron, I am confused about this. from wiki
http://wiki.apache.org/cassandra/ReadRepair, looks for any consistency
level. Read Repair will be done either before or after responding data.

  Read Repair does not run at CL ONE

Daning

On Wed, May 23, 2012 at 3:51 AM, Viktor Jevdokimov 
viktor.jevdoki...@adform.com wrote:

   When RF == number of nodes, and you read at CL ONE you will always be
 reading locally.

 “always be reading locally” – only if Dynamic Snitch is “off”. With
 dynamic snitch “on” request may be redirected to other node, which may
 introduce latency spikes.

 ** **

 ** **


Best regards / Pagarbiai
 *Viktor Jevdokimov*
 Senior Developer

 Email: viktor.jevdoki...@adform.com
 Phone: +370 5 212 3063, Fax +370 5 261 0453
 J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
 Follow us on Twitter: @adforminsider http://twitter.com/#!/adforminsider
 What is Adform: watch this short video http://vimeo.com/adform/display
  [image: Adform News] http://www.adform.com

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.

   *From:* aaron morton [mailto:aa...@thelastpickle.com]
 *Sent:* Wednesday, May 23, 2012 13:00
 *To:* user@cassandra.apache.org
 *Subject:* Re: Replication factor

 ** **

 RF is normally adjusted to modify availability (see
 http://thelastpickle.com/2011/06/13/Down-For-Me/)

 ** **

 for example, if I have 4 nodes cluster in one data center, how can RF=2 vs
 RF=4 affect read performance? If consistency level is ONE, looks reading
 does not need to go to another hop to get data if RF=4, but it would do
 more work on read repair in the background.

  Read Repair does not run at CL ONE.

 When RF == number of nodes, and you read at CL ONE you will always be
 reading locally. But with a low consistency.

 If you read with QUORUM when RF == number of nodes you will still get some
 performance benefit from the data being read locally.

 ** **

 Cheers

 ** **

 ** **

 -

 Aaron Morton

 Freelance Developer

 @aaronmorton

 http://www.thelastpickle.com

 ** **

 On 23/05/2012, at 9:34 AM, Daning Wang wrote:



 

 Hello,

 What is the pros and cons to choose different number of replication factor
 in term of performance? if space is not a concern.

 for example, if I have 4 nodes cluster in one data center, how can RF=2 vs
 RF=4 affect read performance? If consistency level is ONE, looks reading
 does not need to go to another hop to get data if RF=4, but it would do
 more work on read repair in the background.

 Can you share some insights about this?

 Thanks in advance,

 Daning 

 ** **

signature-logo7789.png

Re: Couldn't find cfId

2012-05-16 Thread Daning Wang
Thanks Aaron! We will upgrade to 1.0.9.

Just curious, you said removing the HintedHandoff files from data/system,
what do the HintedHandoff files look like?

Thanks,

Daning

On Wed, May 16, 2012 at 2:32 AM, aaron morton aa...@thelastpickle.comwrote:

 Looks like this https://issues.apache.org/jira/browse/CASSANDRA-3975

 Fixed in the latest 1.0.9.

 Either upgrade (which is always a good idea) or purge the hints from the
 server. Either using JMX or stopping the node and removing the
 HintedHandoff files from data/system.

 In either case you should then run a nodetool repair as hints for other
 CF's may have been dropped.

 Cheers


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 16/05/2012, at 2:27 AM, Daning Wang wrote:

 We got exception UnserializableColumnFamilyException: Couldn't find
 cfId=1075 in the log of one node, describe cluster showed all the nodes in
 same schema version. how to fix this problem? did repair but looks does not
 work, haven't try scrub yet.

 We are on v1.0.3

 ERROR [HintedHandoff:1631] 2012-05-15 07:13:07,877
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[HintedHandoff:1631,1,main]
 java.lang.RuntimeException:
 org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
 cfId=1075
 at
 org.apache.cassandra.utils.FBUtilities.unchecked(FBUtilities.java:689)
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: org.apache.cassandra.db.UnserializableColumnFamilyException:
 Couldn't find cfId=1075
 at
 org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:129)
 at
 org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:401)
 at
 org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:409)
 at
 org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:344)
 at
 org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:248)
 at
 org.apache.cassandra.db.HintedHandOffManager.access$200(HintedHandOffManager.java:84)
 at
 org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:418)

 Thanks,

 Daning





Couldn't find cfId

2012-05-15 Thread Daning Wang
We got exception UnserializableColumnFamilyException: Couldn't find
cfId=1075 in the log of one node, describe cluster showed all the nodes in
same schema version. how to fix this problem? did repair but looks does not
work, haven't try scrub yet.

We are on v1.0.3

ERROR [HintedHandoff:1631] 2012-05-15 07:13:07,877
AbstractCassandraDaemon.java (line 139) Fatal exception in thread
Thread[HintedHandoff:1631,1,main]
java.lang.RuntimeException:
org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
cfId=1075
at
org.apache.cassandra.utils.FBUtilities.unchecked(FBUtilities.java:689)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=1075
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:129)
at
org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:401)
at
org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:409)
at
org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpointInternal(HintedHandOffManager.java:344)
at
org.apache.cassandra.db.HintedHandOffManager.deliverHintsToEndpoint(HintedHandOffManager.java:248)
at
org.apache.cassandra.db.HintedHandOffManager.access$200(HintedHandOffManager.java:84)
at
org.apache.cassandra.db.HintedHandOffManager$3.runMayThrow(HintedHandOffManager.java:418)

Thanks,

Daning


Re: Request timeout and host marked down

2012-04-10 Thread Daning Wang
Thanks Aaron, will seek help from hector team.

On Tue, Apr 10, 2012 at 3:41 AM, aaron morton aa...@thelastpickle.comwrote:

 Caused by: java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129)
 at
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:127)
 ... 31 more

 This looks like a client side timeout to me.

 AFAIK it will use this

 http://rantav.github.com/hector//source/content/API/core/1.0-1/me/prettyprint/cassandra/service/CassandraHost.html#getCassandraThriftSocketTimeout()

 if it's  0 otherwise the value of the CASSANDRA_THRIFT_SOCKET_TIMEOUT JVM
 param

 otherwise 0 i think.

 Hector is one of the many things I am not an expert on. Try the hector
 user list if you are still having problems.



 [cassy@s2.dsat4 ~]$  ~/bin/nodetool -h localhost tpstats
 Pool NameActive   Pending  Completed   Blocked
 All time blocked
 ReadStage 3 3  414129625
 0 0

 Looks fine.

 Hope that helps.


 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 10/04/2012, at 8:08 AM, Daning Wang wrote:

 Thanks Aaron! Here is the exception, is that the timeout between nodes?
 any parameter I can change to reduce timeout?

 me.prettyprint.hector.api.exceptions.HectorTransportException:
 org.apache.thrift.transport.TTransportException:
 java.net.SocketTimeoutException: Read timed out
 at
 me.prettyprint.cassandra.service.ExceptionsTranslatorImpl.translate(ExceptionsTranslatorImpl.java:33)
 at
 me.prettyprint.cassandra.model.CqlQuery$1.execute(CqlQuery.java:130)
 at
 me.prettyprint.cassandra.model.CqlQuery$1.execute(CqlQuery.java:100)
 at
 me.prettyprint.cassandra.service.Operation.executeAndSetResult(Operation.java:103)
 at
 me.prettyprint.cassandra.connection.HConnectionManager.operateWithFailover(HConnectionManager.java:246)
 at
 me.prettyprint.cassandra.model.ExecutingKeyspace.doExecuteOperation(ExecutingKeyspace.java:97)
 at
 me.prettyprint.cassandra.model.CqlQuery.execute(CqlQuery.java:99)
 at
 com.netseer.cassandra.cache.dao.CacheReader.getRows(CacheReader.java:267)
 at
 com.netseer.cassandra.cache.dao.CacheReader.getCache0(CacheReader.java:55)
 at
 com.netseer.cassandra.cache.dao.CacheDao.getCaches(CacheDao.java:85)
 at
 com.netseer.cassandra.cache.dao.CacheDao.getCache(CacheDao.java:71)
 at
 com.netseer.cassandra.cache.dao.CacheDao.getCache(CacheDao.java:149)
 at
 com.netseer.cassandra.cache.service.CacheServiceImpl.getCache(CacheServiceImpl.java:55)
 at
 com.netseer.cassandra.cache.service.CacheServiceImpl.getCache(CacheServiceImpl.java:28)
 at
 com.netseer.dsat.cache.CassandraDSATCacheImpl.get(CassandraDSATCacheImpl.java:62)
 at
 com.netseer.dsat.cache.CassandraDSATCacheImpl.getTimedValue(CassandraDSATCacheImpl.java:144)
 at
 com.netseer.dsat.serving.GenericCacheManager$4.call(GenericCacheManager.java:427)
 at
 com.netseer.dsat.serving.GenericCacheManager$4.call(GenericCacheManager.java:423)
 at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
 at java.util.concurrent.FutureTask.run(FutureTask.java:138)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: org.apache.thrift.transport.TTransportException:
 java.net.SocketTimeoutException: Read timed out
 at
 org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:129)
 at
 org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at
 org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:129)
 at
 org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
 at
 org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
 at
 org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
 at
 org.apache.cassandra.thrift.Cassandra$Client.recv_execute_cql_query(Cassandra.java:1698)
 at
 org.apache.cassandra.thrift.Cassandra$Client.execute_cql_query(Cassandra.java:1682)
 at
 me.prettyprint.cassandra.model.CqlQuery$1.execute(CqlQuery.java:106)
 ... 21 more
 Caused by: java.net.SocketTimeoutException: Read timed out
 at java.net.SocketInputStream.socketRead0(Native Method)
 at java.net.SocketInputStream.read(SocketInputStream.java:129

Re: Request timeout and host marked down

2012-04-09 Thread Daning Wang
 0  0
0 0
HintedHandoff 0 0   2746
0 0

Message type   Dropped
RANGE_SLICE  0
READ_REPAIR  17931
BINARY   0
READ   5185149
MUTATION232317
REQUEST_RESPONSE  1317






On Sun, Apr 8, 2012 at 2:15 PM, aaron morton aa...@thelastpickle.comwrote:

 You need to see if the timeout is from the client to the server, or
 between the server nodes.

 If it's server side a TimedOutException will be thrown from thrift. Take a
 look at the nodetool tpstats on the servers, you will probably see lots of
 Pending tasks. Basically the cluster is overloaded. Consider:

 * check the IO, CPU, GC state on the servers.
 * ensuring the data and requests are evenly spread around the cluster.
 * reducing the number of columns read in a select.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 6/04/2012, at 5:30 AM, Daning Wang wrote:

  Hi all,
 
  We are using Hector and ofter we see lots of timeout exception in the
 log, I know that the hector can failover to other node, but I want to
 reduce the number of timeouts.
 
  any hector parameter I should change to reduce this error?
 
  also, on the server side, any kind of tunning need to do for the timeout?
 
 
  Thanks in advance.
 
 
  12/04/04 15:13:20 ERROR
 com.netseer.services.keywordstat.io.KeywordServiceImpl: Timout 1 ms
  12/04/04 15:13:25 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
 TRIGGERED for host 10.28.78.123(10.28.78.123):9160
  12/04/04 15:13:25 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
 shutdown:
 ConcurrentCassandraClientPoolByHost:{10.28.78.123(10.28.78.123):9160};
 IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
  12/04/04 15:13:44 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
 TRIGGERED for host 10.240.113.171(10.240.113.171):9160
  12/04/04 15:13:44 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
 shutdown:
 ConcurrentCassandraClientPoolByHost:{10.240.113.171(10.240.113.171):9160};
 IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
  12/04/04 15:13:46 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
 TRIGGERED for host 10.28.78.123(10.28.78.123):9160
  12/04/04 15:13:46 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
 shutdown:
 ConcurrentCassandraClientPoolByHost:{10.28.78.123(10.28.78.123):9160};
 IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
  12/04/04 15:13:46 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
 TRIGGERED for host 10.123.83.114(10.123.83.114):9160
  12/04/04 15:13:46 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
 shutdown:
 ConcurrentCassandraClientPoolByHost:{10.123.83.114(10.123.83.114):9160};
 IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
  12/04/04 15:13:46 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
 TRIGGERED for host 10.6.115.239(10.6.115.239):9160
  12/04/04 15:13:46 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
 shutdown:
 ConcurrentCassandraClientPoolByHost:{10.6.115.239(10.6.115.239):9160};
 IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
  12/04/04 15:13:49 ERROR
 com.netseer.services.keywordstat.io.KeywordServiceImpl: Timout 1 ms
  12/04/04 15:13:49 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
 TRIGGERED for host 10.120.205.48(10.120.205.48):9160
  12/04/04 15:13:49 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
 shutdown:
 ConcurrentCassandraClientPoolByHost:{10.120.205.48(10.120.205.48):9160};
 IsActive?: true; Active: 3; Blocked: 0; Idle: 3; NumBeforeExhausted: 17
  12/04/04 15:13:50 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
 TRIGGERED for host 10.28.20.200(10.28.20.200):9160
  12/04/04 15:13:50 ERROR
 me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
 shutdown:
 ConcurrentCassandraClientPoolByHost:{10.28.20.200(10.28.20.200):9160};
 IsActive?: true; Active: 2; Blocked: 0; Idle: 4; NumBeforeExhausted: 18
  12/04/04 15:13:51 ERROR
 com.netseer.services.keywordstat.io.KeywordServiceImpl: Timout 1 ms




Request timeout and host marked down

2012-04-05 Thread Daning Wang
Hi all,

We are using Hector and ofter we see lots of timeout exception in the log,
I know that the hector can failover to other node, but I want to reduce the
number of timeouts.

any hector parameter I should change to reduce this error?

also, on the server side, any kind of tunning need to do for the timeout?


Thanks in advance.


12/04/04 15:13:20 ERROR
com.netseer.services.keywordstat.io.KeywordServiceImpl: Timout 1 ms
12/04/04 15:13:25 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
TRIGGERED for host 10.28.78.123(10.28.78.123):9160
12/04/04 15:13:25 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
shutdown:
ConcurrentCassandraClientPoolByHost:{10.28.78.123(10.28.78.123):9160};
IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
12/04/04 15:13:44 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
TRIGGERED for host 10.240.113.171(10.240.113.171):9160
12/04/04 15:13:44 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
shutdown:
ConcurrentCassandraClientPoolByHost:{10.240.113.171(10.240.113.171):9160};
IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
12/04/04 15:13:46 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
TRIGGERED for host 10.28.78.123(10.28.78.123):9160
12/04/04 15:13:46 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
shutdown:
ConcurrentCassandraClientPoolByHost:{10.28.78.123(10.28.78.123):9160};
IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
12/04/04 15:13:46 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
TRIGGERED for host 10.123.83.114(10.123.83.114):9160
12/04/04 15:13:46 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
shutdown:
ConcurrentCassandraClientPoolByHost:{10.123.83.114(10.123.83.114):9160};
IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
12/04/04 15:13:46 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
TRIGGERED for host 10.6.115.239(10.6.115.239):9160
12/04/04 15:13:46 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
shutdown:
ConcurrentCassandraClientPoolByHost:{10.6.115.239(10.6.115.239):9160};
IsActive?: true; Active: 1; Blocked: 0; Idle: 5; NumBeforeExhausted: 19
12/04/04 15:13:49 ERROR
com.netseer.services.keywordstat.io.KeywordServiceImpl: Timout 1 ms
12/04/04 15:13:49 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
TRIGGERED for host 10.120.205.48(10.120.205.48):9160
12/04/04 15:13:49 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
shutdown:
ConcurrentCassandraClientPoolByHost:{10.120.205.48(10.120.205.48):9160};
IsActive?: true; Active: 3; Blocked: 0; Idle: 3; NumBeforeExhausted: 17
12/04/04 15:13:50 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: MARK HOST AS DOWN
TRIGGERED for host 10.28.20.200(10.28.20.200):9160
12/04/04 15:13:50 ERROR
me.prettyprint.cassandra.connection.HConnectionManager: Pool state on
shutdown:
ConcurrentCassandraClientPoolByHost:{10.28.20.200(10.28.20.200):9160};
IsActive?: true; Active: 2; Blocked: 0; Idle: 4; NumBeforeExhausted: 18
12/04/04 15:13:51 ERROR
com.netseer.services.keywordstat.io.KeywordServiceImpl: Timout 1 ms


Re: Cassandra Exception

2012-03-28 Thread Daning Wang
We upgraded to 1.0.8, and looks the problem is gone.

Thanks for your help,

Daning

On Sun, Mar 25, 2012 at 9:54 AM, aaron morton aa...@thelastpickle.comwrote:

 Can you go to those nodes and run describe cluster ? Also check the logs
 on the machines that are marked as UNREACHABLE .

 A node will be marked as UNREACHABLE if it is DOWN or if it did not
 respond in time.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 23/03/2012, at 11:29 AM, Daning Wang wrote:

 Thanks Aaron. when I do describe cluster, always there are UNREACHABLE,
 but nodetool ring is fine. it is pretty busy cluster, read 3K/sec

 $ cassandra-cli -h localhost -u root -pw cassy
 Connected to: Production Cluster on localhost/9160
 Welcome to the Cassandra CLI.

 Type 'help;' or '?' for help.
 Type 'quit;' or 'exit;' to quit.

 [root@unknown] describe cluster;
 Cluster Information:
Snitch: org.apache.cassandra.locator.SimpleSnitch
Partitioner: org.apache.cassandra.dht.RandomPartitioner
Schema versions:
 UNREACHABLE: [10.218.17.208, 10.123.83.114, 10.120.205.48,
 10.240.113.171]
 e331e720-4844-11e1--d808570c0dfd: [10.28.78.123, 10.28.20.200,
 10.6.115.239]
 [root@unknown]

 $ nodetool -h localhost ring
 Address DC  RackStatus State   Load
 OwnsToken

 141784319550391026443072753096570088105
 10.28.78.123datacenter1 rack1   Up Normal  5.46 GB
 16.67%  0
 10.120.205.48   datacenter1 rack1   Up Normal  5.49 GB
 16.67%  28356863910078205288614550619314017621
 10.6.115.239datacenter1 rack1   Up Normal  5.53 GB
 16.67%  56713727820156410577229101238628035242
 10.28.20.200datacenter1 rack1   Up Normal  5.51 GB
 16.67%  85070591730234615865843651857942052863
 10.123.83.114   datacenter1 rack1   Up Normal  5.49 GB
 16.67%  113427455640312821154458202477256070484
 10.240.113.171  datacenter1 rack1   Up Normal  5.43 GB
 16.67%  141784319550391026443072753096570088105


 Daning


 On Thu, Mar 22, 2012 at 1:47 PM, aaron morton aa...@thelastpickle.comwrote:

  java.io.IOError:
 org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
 cfId=-387130991

 Schema may have diverged between nodes.

 use cassandra-cli and run describe cluster; to see how many schema
 versions you have.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 22/03/2012, at 6:27 AM, Daning Wang wrote:

 and we are on 0.8.6.



 On Wed, Mar 21, 2012 at 10:24 AM, Daning Wang dan...@netseer.com wrote:

 Hi All,


 We got lots of Exception in the log, and later the server crashed. any
 idea what is happening and how to fix it?

 ERROR [RequestResponseStage:4] 2012-03-21 04:16:30,482
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[RequestResponseStage:4,5,main]
 java.io.IOError: java.io.EOFException
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
 at
 org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:125)
 at
 org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:104)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
 ... 6 more
 ERROR [RequestResponseStage:2] 2012-03-21 04:16:30,480
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[RequestResponseStage:2,5,main]
 java.io.IOError:
 org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
 cfId=-387130991
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
 at
 org.apache.cassandra.service.AsyncRepairCallback.response(AsyncRepairCallback.java:47)
 at
 org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: org.apache.cassandra.db.UnserializableColumnFamilyException:
 Couldn't find cfId=-387130991

How to find CF from cfId

2012-03-22 Thread Daning Wang
Hi,

How to find a column family from a cfId? I got a bunch of exception, want
to find out which CF has problem.

java.io.IOError:
org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
cfId=1744830464
at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
at
org.apache.cassandra.service.AsyncRepairCallback.response(AsyncRepairCallback.java:47)
at
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=1744830464
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:123)
at org.apache.cassandra.db.RowSerializer.deserialize(Row.java:69)
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:113)
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)


Daning


Re: Cassandra Exception

2012-03-22 Thread Daning Wang
Thanks Aaron. when I do describe cluster, always there are UNREACHABLE,
but nodetool ring is fine. it is pretty busy cluster, read 3K/sec

$ cassandra-cli -h localhost -u root -pw cassy
Connected to: Production Cluster on localhost/9160
Welcome to the Cassandra CLI.

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[root@unknown] describe cluster;
Cluster Information:
   Snitch: org.apache.cassandra.locator.SimpleSnitch
   Partitioner: org.apache.cassandra.dht.RandomPartitioner
   Schema versions:
UNREACHABLE: [10.218.17.208, 10.123.83.114, 10.120.205.48,
10.240.113.171]
e331e720-4844-11e1--d808570c0dfd: [10.28.78.123, 10.28.20.200,
10.6.115.239]
[root@unknown]

$ nodetool -h localhost ring
Address DC  RackStatus State   Load
OwnsToken

141784319550391026443072753096570088105
10.28.78.123datacenter1 rack1   Up Normal  5.46 GB
16.67%  0
10.120.205.48   datacenter1 rack1   Up Normal  5.49 GB
16.67%  28356863910078205288614550619314017621
10.6.115.239datacenter1 rack1   Up Normal  5.53 GB
16.67%  56713727820156410577229101238628035242
10.28.20.200datacenter1 rack1   Up Normal  5.51 GB
16.67%  85070591730234615865843651857942052863
10.123.83.114   datacenter1 rack1   Up Normal  5.49 GB
16.67%  113427455640312821154458202477256070484
10.240.113.171  datacenter1 rack1   Up Normal  5.43 GB
16.67%  141784319550391026443072753096570088105


Daning


On Thu, Mar 22, 2012 at 1:47 PM, aaron morton aa...@thelastpickle.comwrote:

 java.io.IOError:
 org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
 cfId=-387130991

 Schema may have diverged between nodes.

 use cassandra-cli and run describe cluster; to see how many schema
 versions you have.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 22/03/2012, at 6:27 AM, Daning Wang wrote:

 and we are on 0.8.6.



 On Wed, Mar 21, 2012 at 10:24 AM, Daning Wang dan...@netseer.com wrote:

 Hi All,


 We got lots of Exception in the log, and later the server crashed. any
 idea what is happening and how to fix it?

 ERROR [RequestResponseStage:4] 2012-03-21 04:16:30,482
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[RequestResponseStage:4,5,main]
 java.io.IOError: java.io.EOFException
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
 at
 org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:125)
 at
 org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:104)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
 ... 6 more
 ERROR [RequestResponseStage:2] 2012-03-21 04:16:30,480
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[RequestResponseStage:2,5,main]
 java.io.IOError:
 org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
 cfId=-387130991
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
 at
 org.apache.cassandra.service.AsyncRepairCallback.response(AsyncRepairCallback.java:47)
 at
 org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: org.apache.cassandra.db.UnserializableColumnFamilyException:
 Couldn't find cfId=-387130991
 at
 org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:123)
 at org.apache.cassandra.db.RowSerializer.deserialize(Row.java:69)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:113)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64



 The is the exception before server crashes.


 ERROR [ReadRepairStage:299] 2012-03-21 05:02:53,808

Cassandra Exception

2012-03-21 Thread Daning Wang
Hi All,


We got lots of Exception in the log, and later the server crashed. any idea
what is happening and how to fix it?

ERROR [RequestResponseStage:4] 2012-03-21 04:16:30,482
AbstractCassandraDaemon.java (line 139) Fatal exception in thread
Thread[RequestResponseStage:4,5,main]
java.io.IOError: java.io.EOFException
at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
at
org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:125)
at
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.io.EOFException
at java.io.DataInputStream.readFully(DataInputStream.java:197)
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:104)
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
... 6 more
ERROR [RequestResponseStage:2] 2012-03-21 04:16:30,480
AbstractCassandraDaemon.java (line 139) Fatal exception in thread
Thread[RequestResponseStage:2,5,main]
java.io.IOError:
org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
cfId=-387130991
at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
at
org.apache.cassandra.service.AsyncRepairCallback.response(AsyncRepairCallback.java:47)
at
org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
at
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.cassandra.db.UnserializableColumnFamilyException:
Couldn't find cfId=-387130991
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:123)
at org.apache.cassandra.db.RowSerializer.deserialize(Row.java:69)
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:113)
at
org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
at
org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64



The is the exception before server crashes.


ERROR [ReadRepairStage:299] 2012-03-21 05:02:53,808
AbstractCassandraDaemon.java (line 139) Fatal exception in thread
Thread[ReadRepairStage:299,5,main]
java.lang.RuntimeException:
java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has
shut down
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
Caused by: java.util.concurrent.RejectedExecutionException:
ThreadPoolExecutor has shut down
at
org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:816)
at
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1337)
at
org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
at
org.apache.cassandra.net.MessagingService.sendOneWay(MessagingService.java:388)
at
org.apache.cassandra.net.MessagingService.sendOneWay(MessagingService.java:346)
at
org.apache.cassandra.service.RowRepairResolver.maybeScheduleRepairs(RowRepairResolver.java:121)
at
org.apache.cassandra.service.RowRepairResolver.resolve(RowRepairResolver.java:85)
at
org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:54)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more

Thank you in advance,

Daning


Re: Cassandra Exception

2012-03-21 Thread Daning Wang
and we are on 0.8.6.



On Wed, Mar 21, 2012 at 10:24 AM, Daning Wang dan...@netseer.com wrote:

 Hi All,


 We got lots of Exception in the log, and later the server crashed. any
 idea what is happening and how to fix it?

 ERROR [RequestResponseStage:4] 2012-03-21 04:16:30,482
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[RequestResponseStage:4,5,main]
 java.io.IOError: java.io.EOFException
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
 at
 org.apache.cassandra.service.ReadCallback.response(ReadCallback.java:125)
 at
 org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.io.EOFException
 at java.io.DataInputStream.readFully(DataInputStream.java:197)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:104)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64)
 ... 6 more
 ERROR [RequestResponseStage:2] 2012-03-21 04:16:30,480
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[RequestResponseStage:2,5,main]
 java.io.IOError:
 org.apache.cassandra.db.UnserializableColumnFamilyException: Couldn't find
 cfId=-387130991
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:71)
 at
 org.apache.cassandra.service.AsyncRepairCallback.response(AsyncRepairCallback.java:47)
 at
 org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:49)
 at
 org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: org.apache.cassandra.db.UnserializableColumnFamilyException:
 Couldn't find cfId=-387130991
 at
 org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:123)
 at org.apache.cassandra.db.RowSerializer.deserialize(Row.java:69)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:113)
 at
 org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:82)
 at
 org.apache.cassandra.service.AbstractRowResolver.preprocess(AbstractRowResolver.java:64



 The is the exception before server crashes.


 ERROR [ReadRepairStage:299] 2012-03-21 05:02:53,808
 AbstractCassandraDaemon.java (line 139) Fatal exception in thread
 Thread[ReadRepairStage:299,5,main]
 java.lang.RuntimeException:
 java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has
 shut down
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 Caused by: java.util.concurrent.RejectedExecutionException:
 ThreadPoolExecutor has shut down
 at
 org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:60)
 at
 java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:816)
 at
 java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1337)
 at
 org.apache.cassandra.net.MessagingService.receive(MessagingService.java:490)
 at
 org.apache.cassandra.net.MessagingService.sendOneWay(MessagingService.java:388)
 at
 org.apache.cassandra.net.MessagingService.sendOneWay(MessagingService.java:346)
 at
 org.apache.cassandra.service.RowRepairResolver.maybeScheduleRepairs(RowRepairResolver.java:121)
 at
 org.apache.cassandra.service.RowRepairResolver.resolve(RowRepairResolver.java:85)
 at
 org.apache.cassandra.service.AsyncRepairCallback$1.runMayThrow(AsyncRepairCallback.java:54)
 at
 org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
 ... 3 more

 Thank you in advance,

 Daning



hector connection pool

2012-03-05 Thread Daning Wang
I just got this error : All host pools marked down. Retry burden pushed
out to client. in a few clients recently, client could not  recover, we
have to restart client application.  we are using 0.8.0.3 hector.

At that time we did compaction  for a CF, it takes several hours, server
was busy. But I think client should recover after server load was down.

Any bug reported about this? I did search but could not find one.

Thanks,

Daning


Re: Rebalance cluster

2012-01-12 Thread Daning Wang
Thank you guys. very appreciated.

How about just pulling the slow machines out of cluster? I think the most
of reads should already from fast machine right now because of dynamic
snitch. so removing two machines should not add much loads on the remaining
nodes.

How do you think?

Thanks,

Daning

On Wed, Jan 11, 2012 at 1:34 PM, Antonio Martinez antyp...@gmail.comwrote:

 There is another possible approach that I reference from the original
 Dynamo paper. Instead of trying to manage a heterogeneous cluster at the
 cassandra level, it might be possible to take the approach Amazon took.
 Find the smallest common denominator of resource for your nodes(most likely
 your smallest node) and virtualize the others to that level. For example,
 say you have 3 physical computers, one with one processor and 2gb of
 memory, one with 2 processors and 4gb, and one with 4 and 8gb. You could
 make the smallest one your basic block and then put two one processor 2gb
 vm's on the second machine and 4 of those on the third and largest machine.
 Then instead of managing the three of them separately and worrying about
 them being different you instead manage a ring of 7 equal nodes with equal
 portions of the ring. This allows you to give smaller machines a lesser
 load compared to the more powerful ones. The amazon paper on dynamo has
 more information on how they did it and some of the tricks they use for
 reliability.
 http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

 Hope this helps somewhat

 On Wed, Jan 11, 2012 at 2:00 PM, aaron morton aa...@thelastpickle.comwrote:

 I have good news and bad.

 The good news is I have a nice coffee. The bad news is it's pretty
 difficult to have some nodes with less load.

 In a cluster with 5 nodes and RF 3 each node holds the following token
 ranges.

 node1: node 1, 5 and 4
 node 2: node 2, 1, 5
 node 3: node 3, 2, 1
 node 4: node 4, 3, 2
 node 5: node 5, 4, 3

 The load on each node is it's token range, and those of the preceding
 RF-1 nodes. e.g. In a balanced ring of 5 nodes with RF 3 each node has 20 %
 of the token ring and 60% of the total load.

 if you split the token ring is split like this below each node has the
 total load shown after the /

 node 1: 12.5 %  / 50%
 node 2: 25 % / 62.5%
 node 3:  25 % / 62.5%
 node 4: 12.5 % / 62.5%
 node 5: 25% / 62.5 %

 Only node 1 gets a small amount less. Try a different approach…

 node 1: 12.5 %  / 62.5%
 node 2: 12.5 % / 50%
 node 3: 25 % / 50%
 node 4: 25 % / 62.5%
 node 5: 25 % / 75.5 %

 That's even worse.

 David is right to use nodetool move. It's a good idea to update the
 initial tokens in the yaml (or your ops condif) after the fact even though
 they are not used.

 Hope that helps.

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 12/01/2012, at 8:41 AM, David McNelis wrote:

 Daning,

 You can see how to do this basic sort of thing on the Wiki's operations
 page ( http://wiki.apache.org/cassandra/Operations )

 In short, you'll want to run:
 nodetool -h hostname move newtoken

 Then, once you've update each of your tokens that you want to move,
 you'll want to run
 nodetool -h hostname cleanup

 That will remove the no-longer necessary tokens from your smaller
 machines.

 Please note that someone else may have some better insights than I into
 whether or not  your strategy is going to be effective.  On the surface I
 think what you are doing is logical, but I'm unsure of the  actual
 performance gains you'll see.

 David

 On Wed, Jan 11, 2012 at 1:32 PM, Daning Wang dan...@netseer.com wrote:

 Hi All,

 We have 5 nodes cluster(on 0.8.6), but two machines are slower and have
 less memory, so the performance was not good  on those two machines for
 large volume traffic.I want to move some data from slower machine to faster
 machine to ease some load, the token ring will not be equally balanced.

 I am thinking the following steps,

 1. modify cassandra.yaml to change the initial token.
 2. restart cassandra(don't need to auto-bootstrap, right?)
 3. then run nodetool repair,(or nodetool move?, not sure which one to
 use)


 Is there any doc that has detailed steps about how to do this?

 Thanks in advance,

 Daning






 --
 Antonio Perez de Tejada Martinez




Rebalance cluster

2012-01-11 Thread Daning Wang
Hi All,

We have 5 nodes cluster(on 0.8.6), but two machines are slower and have
less memory, so the performance was not good  on those two machines for
large volume traffic.I want to move some data from slower machine to faster
machine to ease some load, the token ring will not be equally balanced.

I am thinking the following steps,

1. modify cassandra.yaml to change the initial token.
2. restart cassandra(don't need to auto-bootstrap, right?)
3. then run nodetool repair,(or nodetool move?, not sure which one to use)


Is there any doc that has detailed steps about how to do this?

Thanks in advance,

Daning


Pending on ReadStage

2012-01-06 Thread Daning Wang
Hi all,

We have 5 nodes cluster(0.8.6), but the performance from one node is way
behind others, I checked tpstats, It always show non-zero pending
ReadStage, I don't see this problem on other nodes.

What caused the problem? I/O? Memory? Cpu usage is still low. How to fix
this problem?

~/bin/nodetool -h localhost tpstats
Pool NameActive   Pending  Completed   Blocked  All
time blocked
ReadStage1115  56960
0 0
RequestResponseStage  0 0 606695
0 0
MutationStage 0 0 538634
0 0
ReadRepairStage   0 0 17
0 0
ReplicateOnWriteStage 0 0  0
0 0
GossipStage   0 0   5734
0 0
AntiEntropyStage  0 0  0
0 0
MigrationStage0 0  0
0 0
MemtablePostFlusher   0 0  7
0 0
StreamStage   0 0  0
0 0
FlushWriter   0 0  8
0 0
MiscStage 0 0  0
0 0
FlushSorter   0 0  0
0 0
InternalResponseStage 0 0  0
0 0
HintedHandoff 1 4  0
0 0

Message type   Dropped
RANGE_SLICE  0
READ_REPAIR  0
BINARY   0
READ  9082
MUTATION 0
REQUEST_RESPONSE 0

Thanks you in advance.

Daning


Re: Pending on ReadStage

2012-01-06 Thread Daning Wang
Thanks for your reply.

Nodes are equally balanced. and it is RandomPartitioner. I think that
machine is slower, Are you saying it is IO issue?

Daning

On Fri, Jan 6, 2012 at 10:25 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Are all your nodes equally balanced in terms of read requests? Are you
 using RandomPartitioner? Are you reading using indexes?

 First thing you can do is compare iostat -x output between the 2 nodes
 to rule out any io issues assuming your read requests are equally
 balanced.

 On Fri, Jan 6, 2012 at 10:11 AM, Daning Wang dan...@netseer.com wrote:
  Hi all,
 
  We have 5 nodes cluster(0.8.6), but the performance from one node is way
  behind others, I checked tpstats, It always show non-zero pending
 ReadStage,
  I don't see this problem on other nodes.
 
  What caused the problem? I/O? Memory? Cpu usage is still low. How to fix
  this problem?
 
  ~/bin/nodetool -h localhost tpstats
  Pool NameActive   Pending  Completed   Blocked
 All
  time blocked
  ReadStage1115  56960
  0 0
  RequestResponseStage  0 0 606695
  0 0
  MutationStage 0 0 538634
  0 0
  ReadRepairStage   0 0 17
  0 0
  ReplicateOnWriteStage 0 0  0
  0 0
  GossipStage   0 0   5734
  0 0
  AntiEntropyStage  0 0  0
  0 0
  MigrationStage0 0  0
  0 0
  MemtablePostFlusher   0 0  7
  0 0
  StreamStage   0 0  0
  0 0
  FlushWriter   0 0  8
  0 0
  MiscStage 0 0  0
  0 0
  FlushSorter   0 0  0
  0 0
  InternalResponseStage 0 0  0
  0 0
  HintedHandoff 1 4  0
  0 0
 
  Message type   Dropped
  RANGE_SLICE  0
  READ_REPAIR  0
  BINARY   0
  READ  9082
  MUTATION 0
  REQUEST_RESPONSE 0
 
  Thanks you in advance.
 
  Daning
 



Cassandra memory usage

2012-01-03 Thread Daning Wang
I have Cassandra server which has JVM setting -Xms4G -Xmx4G, but why top
reports 15G RES memory and 11G SHR memory usage? I understand that -Xmx4G
is only for the heap size. but it is strange that OS reports 2.5 times
memory usage. Are there a lot of memory used by JNI? Please help to explain
this.

 cassy 2549 39.7 66.1 163805536 16324648 ?  Sl   Jan02 338:48
/usr/local/cassy/java/current/bin/java -ea
-javaagent:./../lib/jamm-0.2.2.jar -XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42* -Xms4G -Xmx4G
-Xmn1G*-XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=10 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true
-Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false -Dmx4jport=8085
-Djava.rmi.server.hostname=10.210.101.106
-Dlog4j.configuration=log4j-server.properties
-Dlog4j.defaultInitOverride=true
-Dpasswd.properties=./../conf/passwd.properties -cp
./../conf:./../build/classes/main:./../build/classes/thrift:./../lib/antlr-3.2.jar:./../lib/apache-cassandra-0.8.6.jar:./../lib/apache-cassandra-thrift-0.8.6.jar:./../lib/avro-1.4.0-fixes.jar:./../lib/avro-1.4.0-sources-fixes.jar:./../lib/commons-cli-1.1.jar:./../lib/commons-codec-1.2.jar:./../lib/commons-collections-3.2.1.jar:./../lib/commons-lang-2.4.jar:./../lib/concurrentlinkedhashmap-lru-1.1.jar:./../lib/guava-r08.jar:./../lib/high-scale-lib-1.1.2.jar:./../lib/jackson-core-asl-1.4.0.jar:./../lib/jackson-mapper-asl-1.4.0.jar:./../lib/jamm-0.2.2.jar:./../lib/jline-0.9.94.jar:./../lib/jna.jar:./../lib/json-simple-1.1.jar:./../lib/libthrift-0.6.jar:./../lib/log4j-1.2.16.jar:./../lib/mx4j-tools.jar:./../lib/servlet-api-2.5-20081211.jar:./../lib/slf4j-api-1.6.1.jar:./../lib/slf4j-log4j12-1.6.1.jar:./../lib/snakeyaml-1.6.jar
org.apache.cassandra.thrift.CassandraDaemon


Top

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+
COMMAND

  2549 cassy 21   0  156g * 15g  11g *S 66.9 65.5 338:02.72 java


Thank you in advance,


Daning


TimedOutException()

2012-01-03 Thread Daning Wang
Hi All,

We are getting TimedOutException() when inserting data into Cassandra, it
was working fine for a few months, but suddenly got this problem. I have
increase rpc_timout_in_ms to 3, but it still timed out in 30 secs.

I turned on debug, I saw many of this error in the log

DEBUG [pool-2-thread-420] 2012-01-03 15:25:43,689
CustomTThreadPoolServer.java (line 197) Thrift transport error occurred
during processing of message.
org.apache.thrift.transport.TTransportException: Cannot read. Remote side
has closed. Tried to read 4 bytes, but only got 0 bytes. (This is often
indicative of an internal error on the server side. Please ch
eck your server logs.)
at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)
at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:297)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:204)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2877)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


We are on 0.8.6. Any idea how to fix this? Your help is much appreciated.

Daning


Re: Weird problem with empty CF

2011-10-03 Thread Daning Wang
Lots of SliceQueryFilter in the log, is that handling tombstone?

DEBUG [ReadStage:49] 2011-10-03 20:15:07,942 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317582939743663:true:4@1317582939933000
DEBUG [ReadStage:50] 2011-10-03 20:15:07,942 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317573253148778:true:4@1317573253354000
DEBUG [ReadStage:43] 2011-10-03 20:15:07,942 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317669552951428:true:4@1317669553018000
DEBUG [ReadStage:33] 2011-10-03 20:15:07,942 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317581886709261:true:4@1317581886957000
DEBUG [ReadStage:52] 2011-10-03 20:15:07,942 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317568165152246:true:4@1317568165482000
DEBUG [ReadStage:36] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317567265089211:true:4@1317567265405000
DEBUG [ReadStage:53] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317674324843122:true:4@1317674324946000
DEBUG [ReadStage:38] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317571990078721:true:4@1317571990141000
DEBUG [ReadStage:57] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317671855234221:true:4@1317671855239000
DEBUG [ReadStage:54] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317558305262954:true:4@1317558305337000
DEBUG [RequestResponseStage:11] 2011-10-03 20:15:07,941
ResponseVerbHandler.java (line 48) Processing response on a callback from
12347@/10.210.101.104
DEBUG [RequestResponseStage:9] 2011-10-03 20:15:07,941
AbstractRowResolver.java (line 66) Preprocessed data response
DEBUG [RequestResponseStage:13] 2011-10-03 20:15:07,941
AbstractRowResolver.java (line 66) Preprocessed digest response
DEBUG [ReadStage:58] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317581337972739:true:4@1317581338044000
DEBUG [ReadStage:64] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317582656796332:true:4@131758265697
DEBUG [ReadStage:55] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317569432886284:true:4@1317569432984000
DEBUG [ReadStage:45] 2011-10-03 20:15:07,941 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317572658687019:true:4@1317572658718000
DEBUG [ReadStage:47] 2011-10-03 20:15:07,940 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317582281617755:true:4@1317582281717000
DEBUG [ReadStage:48] 2011-10-03 20:15:07,940 SliceQueryFilter.java (line
123) collecting 0 of 1: 1317549607869226:true:4@1317549608118000
DEBUG [ReadStage:34] 2011-10-03 20:15:07,940 SliceQueryFilter.java (line
123) collecting 0 of 1:
On Thu, Sep 29, 2011 at 2:17 PM, aaron morton aa...@thelastpickle.comwrote:

 As with any situation involving the un-dead, it really is the number of
 Zombies, Mummies or Vampires that is the concern.

 If you delete data there will always be tombstones. If you have a delete
 heavy workload there will be more tombstones. This is why implementing a
 queue with cassandra is a bad idea.

 gc_grace_seconds (and column TTL) are the *minimum* about of time the
 tombstones will stay in the data files, there is no maximum.

 Your read performance also depends on the number of SSTables the row is
 spread over, see
 http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

 If you really wanted to purge them then yes a repair and then major
 compaction would be the way to go. Also consider if it's possible to design
 the data model around the problem, e.g. partitioning rows by date. IMHO I
 would look to make data model changes before implementing a compaction
 policy, or consider if cassandra is the right store if you have a delete
 heavy workload.

 Cheers


 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 30/09/2011, at 3:27 AM, Daning Wang wrote:

 Jonathan/Aaron,

 Thank you guy's reply, I will change GCGracePeriod to 1 day to see what
 will happen.

 Is there a way to purge tombstones at anytime? because if tombstones affect
 performance, we want them to be purged right away, not after GCGracePeriod.
 We know all the nodes are up, and we can do repair first to make sure the
 consistency before purging.

 Thanks,

 Daning


 On Wed, Sep 28, 2011 at 5:22 PM, aaron morton aa...@thelastpickle.comwrote:

 if I had to guess I would say it was spending time handling tombstones. If
 you see it happen again, and are interested, turn the logging up to DEBUG
 and look for messages from something starting with Slice

 Minor (automatic) compaction will, over time, purge the tombstones. Until
 then reads must read discard the data deleted by the tombstones. If you
 perform a big (i.e. 100k's ) delete this can reduce performance until
 compaction does it's thing.

 My second guess would be read repair (or the simple consistency checks on
 read

Queue suggestion in Cassandra

2011-09-16 Thread Daning Wang
We try to implement an ordered queue system in Cassandra(ver 0.8.5). In
initial design  we use a row as queue,  a column for each item in queue.
that means creating new column when inserting  item and delete column when
top item is popped. Since columns are sorted in Cassandra we got the ordered
queue.

It works fine until queue size reaches 50K, then we got high CPU usage and
constant GC, that makes the whole Cassandra server very slow and not
responsive, we have to do full compaction to fix this problem.

Due to this performance issue that this queue is not useful for us. We are
looking for other designs. I want to know if anybody has implemented a large
ordered queue successfully.

Let me know if you have suggestion,

Thank you in advance.

Daning


ByteOrderedPartitioner

2011-09-16 Thread Daning Wang
How is the performance of ByteOrderedPartitioner, compared to
RandomPartitioner? the perforamnce when getting data with single key, does
it use same algorithm?

I have read that the downside of ByteOrderedPartitioner is creating hotspot.
But if I have 4 nodes and I set RF to 4, that will replicate data to all 4
nodes, that could avoid hot spot, right?

Thank you in advance,

Daning