from:"Wei Zhu"

Re: one big cluster vs multiple smaller clusters

2013-10-14 Thread Wei Zhu

Hi guys,
Thanks for your reply. It's very helpful.

I agree with Plotnik on the scaling part. 

For the business logic, it sounds obvious that it make sense to divide, i.e. 
metadata and really BIG data into different clusters, of course. as you 
mentioned. But after I think about it a bit more, what is the real reason for 
that if the cluster can be scaled horizontally? Each node still has the same 
amount of data, what is the benefit of having a separate cluster?

For analytics application, it has to be on a separate cluster as Paulo pointed 
out. But if all of use cases are for web application (as of now) what are the 
drawbacks to put everything into one big cluster? 


Thanks.
-Wei



On Monday, October 14, 2013 4:15 AM, Paulo Motta pauloricard...@gmail.com 
wrote:
 
By clusters do you mean data centers? If so, I think it depends on your use 
case and application requirements.

For instance, if you have a web application and and analytics application 
(hadoop), you would want to separate your cluster in 2 different data centers 
(even if they're located in the same physical zone). If it's just one 
application then you can start with one data center and add more data centers 
later if needed.



2013/10/14 Plotnik, Alexey aplot...@rhonda.ru

If you are talking about scaling: Cassandra scaling is absolutely horizontal 
without namenodes or other Mongo-bulshit-like intermediate daemons. And that’s 
why one big cluster has the same throughput as many smaller clusters.
What will you do when your small clusters will exceed it’s capacity? Cassandra 
is designed for very large data so feel free to utilize it’s capabilities.
 
If you are talking in terms of business logic: it make sense to divide, i.e. 
metadata and really BIG data into different clusters, of course.
 
From:Wz1975 [mailto:wz1...@yahoo.com] 
Sent: 14 октября 2013 г. 7:20
To: user@cassandra.apache.org

Subject: Re: one big cluster vs multiple smaller clusters
Importance: Low
 
we have choices of making one big cluster vs a few small clusters. I am trying 
to get pros and cons for both options in genera. 


Thanks.
-Wei

Sent from my Samsung smartphone on ATT 


 Original message 
Subject: Re: one big cluster vs multiple smaller clusters 
From: Jon Haddad j...@jonhaddad.com 
To: user@cassandra.apache.org 
CC: 


This is a pretty vague question.  What are you trying to achieve?
 
On Oct 12, 2013, at 9:05 PM, Wei Zhu wz1...@yahoo.com wrote:



Hi,
As we bring more use cases to Cassandra, we have been thinking about the best 
way to host it. Let's say we will have 15 physical machines available, we can 
use all of them to form a big cluster or divide them into 3 clusters with 5 
nodes each. As we will deploy to 1.2, it becomes easier to expand the cluster 
with vnodes. I really don't see any good reasons to make 3 smaller clusters. 
Did I miss anything obvious?

Thanks.
-Wei
 


-- 

Paulo Ricardo

-- 
European Master in Distributed Computing
Royal Institute of Technology - KTH

Instituto Superior Técnico - IST
http://paulormg.com

one big cluster vs multiple smaller clusters

2013-10-12 Thread Wei Zhu

Hi, 
As we bring more use cases to Cassandra, we have been thinking about the best 
way to host it. Let's say we will have 15 physical machines available, we can 
use all of them to form a big cluster or divide them into 3 clusters with 5 
nodes each. As we will deploy to 1.2, it becomes easier to expand the cluster 
with vnodes. I really don't see any good reasons to make 3 smaller clusters. 
Did I miss anything obvious? 

Thanks. 
-Wei

Re: sstable size change

2013-07-24 Thread Wei Zhu

what is output of show keyspaces from cassandra-cli, did you see the new value?

  Compaction Strategy: 
org.apache.cassandra.db.compaction.LeveledCompactionStrategy
      Compaction Strategy Options:
        sstable_size_in_mb: XXX



 From: Keith Wright kwri...@nanigans.com
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Wednesday, July 24, 2013 3:44 PM
Subject: Re: sstable size change
 


Hi all,

   This morning I increased the SSTable size for one of my LCS via an alter 
command and saw at least one compaction run (I did not trigger a compaction via 
nodetool nor upgrades stables nor removing the .json file).  But so far my data 
sizes appear the same at the default 5 MB (see below for output of ls –Sal as 
well as relevant portion of cfstats).   Is this expected?  I was hoping to see 
at least one file at the new 256 MB size I set.

Thanks

SSTable count: 4965
SSTables in each level: [0, 10, 112/100, 1027/1000, 3816, 0, 0, 0]
Space used (live): 29062393142
Space used (total): 29140547702
Number of Keys (estimate): 195103104
Memtable Columns Count: 441483
Memtable Data Size: 205486218
Memtable Switch Count: 243
Read Count: 154226729

-rw-rw-r--  1 cassandra cassandra 5247564 Jul 18 01:33 
users-shard_user_lookup-ib-97153-Data.db
-rw-rw-r--  1 cassandra cassandra 5247454 Jul 23 02:59 
users-shard_user_lookup-ib-109063-Data.db
-rw-rw-r--  1 cassandra cassandra 5247421 Jul 20 14:58 
users-shard_user_lookup-ib-103127-Data.db
-rw-rw-r--  1 cassandra cassandra 5247415 Jul 17 13:56 
users-shard_user_lookup-ib-95761-Data.db
-rw-rw-r--  1 cassandra cassandra 5247379 Jul 21 02:44 
users-shard_user_lookup-ib-104718-Data.db
-rw-rw-r--  1 cassandra cassandra 5247346 Jul 21 21:54 
users-shard_user_lookup-ib-106280-Data.db
-rw-rw-r--  1 cassandra cassandra 5247242 Jul  3 19:41 
users-shard_user_lookup-ib-66049-Data.db
-rw-rw-r--  1 cassandra cassandra 5247235 Jul 21 02:44 
users-shard_user_lookup-ib-104737-Data.db
-rw-rw-r--  1 cassandra cassandra 5247233 Jul 20 14:58 
users-shard_user_lookup-ib-103169-Data.db

From:  sankalp kohli kohlisank...@gmail.com
Reply-To:  user@cassandra.apache.org user@cassandra.apache.org
Date:  Tuesday, July 23, 2013 3:04 PM
To:  user@cassandra.apache.org user@cassandra.apache.org
Subject:  Re: sstable size change


Will Cassandra force any newly compacted files to my new setting as 
compactions are naturally triggered 
Yes. Let it compact and increase in size. 



On Tue, Jul 23, 2013 at 9:38 AM, Robert Coli rc...@eventbrite.com wrote:

On Tue, Jul 23, 2013 at 6:48 AM, Keith Wright kwri...@nanigans.com wrote:

Can you elaborate on what you mean by let it take its own course 
organically?  Will Cassandra force any newly compacted files to my new 
setting as compactions are naturally triggered?


You see, when two (or more!) SSTables love each other very much, they 
sometimes decide they want to compact together..


But seriously, yes. If you force all existing SSTables to level 0, it is as 
if you just flushed them all. Level compaction then does a whole lot of 
compaction, using the active table size.


=Rob

Re: AssertionError: Unknown keyspace?

2013-06-24 Thread Wei Zhu

I have got bitten by it once. At least there should be a message saying, there 
is no streaming data since it's a seed node. 
I searched the source code, the message was there and it got removed at certain 
version.

-Wei 




 From: Robert Coli rc...@eventbrite.com
To: user@cassandra.apache.org 
Sent: Monday, June 24, 2013 10:34 AM
Subject: Re: AssertionError: Unknown keyspace?
 

On Mon, Jun 24, 2013 at 6:04 AM, Hiller, Dean dean.hil...@nrel.gov wrote:
 Oh shoot, this is a seed node.  Is there documentation on how to bootstrap
 a seed node?  If I have seeds of A, B, C for every machine on the ring and
 I am bootstrapping node B, do I just modify cassandra.yaml and remove node
 B from the yaml file temporarily and boot it up

Yes. The only thing that makes a node fail that check is being in its
own seed list. But if the node is in other nodes' seed lists, those
nodes will contact it anyway. This strongly implies that the
contains() check there is the wrong test, but I've never nailed that
down and/or filed a ticket on it. Conversation at the summit suggests
I should, making a note to do so...

=Rob

Re: AssertionError: Unknown keyspace?

2013-06-24 Thread Wei Zhu

Here is the line in the source code for 1.1.0: 

https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/service/StorageService.java#L549
 

And it's refactored later to this, and the message was removed. 

https://github.com/apache/cassandra/blob/cassandra-1.2.0/src/java/org/apache/cassandra/service/StorageService.java#L549
 

-Wei 

- Original Message -

From: Dean Hiller dean.hil...@nrel.gov 
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com 
Sent: Monday, June 24, 2013 12:04:10 PM 
Subject: Re: AssertionError: Unknown keyspace? 

Yes, it would be nice at startup just to say don't list your seed node as this 
node and then fail out and we would have known this a long long time ago ;). 
Dean 

From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com 
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu 
wz1...@yahoo.commailto:wz1...@yahoo.com 
Date: Monday, June 24, 2013 12:36 PM 
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Subject: Re: AssertionError: Unknown keyspace? 

I have got bitten by it once. At least there should be a message saying, there 
is no streaming data since it's a seed node. 
I searched the source code, the message was there and it got removed at certain 
version. 

-Wei 

 
From: Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com 
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Sent: Monday, June 24, 2013 10:34 AM 
Subject: Re: AssertionError: Unknown keyspace? 

On Mon, Jun 24, 2013 at 6:04 AM, Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: 
 Oh shoot, this is a seed node. Is there documentation on how to bootstrap 
 a seed node? If I have seeds of A, B, C for every machine on the ring and 
 I am bootstrapping node B, do I just modify cassandra.yaml and remove node 
 B from the yaml file temporarily and boot it up 

Yes. The only thing that makes a node fail that check is being in its 
own seed list. But if the node is in other nodes' seed lists, those 
nodes will contact it anyway. This strongly implies that the 
contains() check there is the wrong test, but I've never nailed that 
down and/or filed a ticket on it. Conversation at the summit suggests 
I should, making a note to do so... 

=Rob

Re: Updated sstable size for LCS, ran upgradesstables, file sizes didn't change

2013-06-21 Thread Wei Zhu

I think the new SSTable will be in the new size. In order to do that, you need 
to trigger a compaction so that the new SSTables will be generated. for LCS, 
there is no major compaction though. You can run a nodetool repair and 
hopefully you will bring some new SSTables and compactions will kick in. 
Or you can change the $CFName.json file under your data directory and move 
every SSTable to level 0. You need to stop your node, write a simple script to 
alter that file and start the node again. 

I think it will be helpful to have a nodetool command to change the SSTable 
Size and trigger the rebuild of the SSTables. 

Thanks. 
-Wei 

- Original Message -

From: Robert Coli rc...@eventbrite.com 
To: user@cassandra.apache.org 
Sent: Friday, June 21, 2013 4:51:29 PM 
Subject: Re: Updated sstable size for LCS, ran upgradesstables, file sizes 
didn't change 

On Fri, Jun 21, 2013 at 4:40 PM, Andrew Bialecki 
andrew.biale...@gmail.com wrote: 
 However when we run alter the column 
 family and then run nodetool upgradesstables -a keyspace columnfamily, the 
 files in the data directory have been re-written, but the file sizes are the 
 same. 
 
 Is this the expected behavior? If not, what's the right way to upgrade them. 
 If this is expected, how can we benchmark the read/write performance with 
 varying sstable sizes. 

It is expected, upgradesstables/scrub/clean compactions work on a 
single sstable at a time, they are not capable of combining or 
splitting them. 

In theory you could probably : 

1) start out with the largest size you want to test 
2) stop your node 
3) use sstable_split [1] to split sstables 
4) start node, test 
5) repeat 2-4 

I am not sure if there is anything about level compaction which makes 
this infeasible. 

=Rob 
[1] https://github.com/pcmanus/cassandra/tree/sstable_split

Re: Data not fully replicated with 2 nodes and replication factor 2

2013-06-20 Thread Wei Zhu

I don't think you can fully trust hintedhandoff, it's more like we are trying 
our best to deliver it but no guarantee. Even if the hints are guaranteed to 
be delivered and there will be a delay which is supposed to be part of 
eventual consistency paradigm. If you want enforce real consistency, change 
your consistency level. Or do a repair. 

Thanks. 
-Wei 

- Original Message -

From: James Lee james@metaswitch.com 
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com, 
rc...@eventbrite.com 
Sent: Thursday, June 20, 2013 3:21:30 AM 
Subject: RE: Data not fully replicated with 2 nodes and replication factor 2 

Rob, Wei, thank you both for your responses - from what Rob says below my test 
is a valid one. 

I've run some additional tests and observed the following: 
-- I mentioned before that some of the initial writes initially failed and then 
succeed when the test tool retries them. I've checked that there's no 
correlation between the keys for writes which required a retry and the keys for 
the failed reads (i.e. the reads are failing for keys that were written fine at 
the first attempt). 
-- I've retried this test but limiting the rate of initial writes to be much 
lower (from 8000/s down to 2000/s). This makes the problem go away completely: 
no more read failures. 

So it seems like I have exposed a genuine bug in Cassandra replication which 
manifests under high write load. What's the best next step - should I be filing 
a bug report, and if so what diagnostics are likely to be useful? 

Thanks, 
James Lee 


-Original Message- 
From: Robert Coli [mailto:rc...@eventbrite.com] 
Sent: 19 June 2013 20:59 
To: user@cassandra.apache.org; Wei Zhu 
Subject: Re: Data not fully replicated with 2 nodes and replication factor 2 

On Wed, Jun 19, 2013 at 11:43 AM, Wei Zhu wz1...@yahoo.com wrote: 
 I think hints are only stored when the other node is down, not on the 
 dropped mutations. (Correct me if I am wrong, actually it's not a bad 
 idea to store hints for dropped mutations and replay them later?) 

This used to be the way it worked pre-1.0... 

https://issues.apache.org/jira/browse/CASSANDRA-2034 

In modern cassandra, anything but a successful ack from a coordinated write 
results in a hint on the coordinator. 

 To solve your issue, as I mentioned, either do nodetool repair, or 
 increase your consistency level. By the way, you probably write 
 faster than your cluster can handle if you see that many dropped mutations. 

If his hints are ultimately delivered, OP should not need repair to be 
consistent. 

=Rob

Re: Heap is not released and streaming hangs at 0%

2013-06-19 Thread Wei Zhu

If you want, you can try to force the GC through Jconsole. Memory-Perform GC. 

It theoretically triggers a full GC and when it will happen depends on the JVM 

-Wei 

- Original Message -

From: Robert Coli rc...@eventbrite.com 
To: user@cassandra.apache.org 
Sent: Tuesday, June 18, 2013 10:43:13 AM 
Subject: Re: Heap is not released and streaming hangs at 0% 

On Tue, Jun 18, 2013 at 10:33 AM, srmore comom...@gmail.com wrote: 
 But then shouldn't JVM C G it eventually ? I can still see Cassandra alive 
 and kicking but looks like the heap is locked up even after the traffic is 
 long stopped. 

No, when GC system fails this hard it is often a permanent failure 
which requires a restart of the JVM. 

 nodetool -h localhost flush didn't do much good. 

This adds support to the idea that your heap is too full, and not full 
of memtables. 

You could try nodetool -h localhost invalidatekeycache, but that 
probably will not free enough memory to help you. 

=Rob

Re: Data not fully replicated with 2 nodes and replication factor 2

2013-06-19 Thread Wei Zhu

You have a lot of Dropped Mutations which means those writes might not go 
through. Since you have CL.ONE as write consistency, your client doesn't see 
the exception if write fails only on one node. 
I think hints are only stored when the other node is down, not on the dropped 
mutations. (Correct me if I am wrong, actually it's not a bad idea to store 
hints for dropped mutations and replay them later?) 

To solve your issue, as I mentioned, either do nodetool repair, or increase 
your consistency level. By the way, you probably write faster than your cluster 
can handle if you see that many dropped mutations. 

-Wei 

- Original Message -

From: James Lee james@metaswitch.com 
To: user@cassandra.apache.org 
Sent: Wednesday, June 19, 2013 2:22:39 AM 
Subject: RE: Data not fully replicated with 2 nodes and replication factor 2 

The test tool I am using catches any exceptions on the original writes and 
resubmits the write request until it's successful (bailing out after 5 
failures). So for each key Cassandra has reported a successful write. 


Nodetool says the following - I'm guessing the pending hinted handoff is the 
interesting bit? 

comet-mvs01:/dsc-cassandra-1.2.2# ./bin/nodetool tpstats 
Pool Name Active Pending Completed Blocked All time blocked 
ReadStage 0 0 35445 0 0 
RequestResponseStage 0 0 1535171 0 0 
MutationStage 0 0 3038941 0 0 
ReadRepairStage 0 0 2695 0 0 
ReplicateOnWriteStage 0 0 0 0 0 
GossipStage 0 0 2898 0 0 
AntiEntropyStage 0 0 0 0 0 
MigrationStage 0 0 245 0 0 
MemtablePostFlusher 0 0 1260 0 0 
FlushWriter 0 0 633 0 212 
MiscStage 0 0 0 0 0 
commitlog_archiver 0 0 0 0 0 
InternalResponseStage 0 0 0 0 0 
HintedHandoff 1 1 0 0 0 

Message type Dropped 
RANGE_SLICE 0 
READ_REPAIR 0 
BINARY 0 
READ 0 
MUTATION 60427 
_TRACE 0 
REQUEST_RESPONSE 0 


Looking at the hints column family in the system keyspace, I see one row with a 
large number of columns. Presumably that along with the nodetool output above 
suggests there are hinted handoffs pending? How long should I expect these to 
remain for? 

Ah, actually now that I re-run the command it seems that nodetool now reports 
that hint as completed and there are no hints left in the system keyspace on 
either node. I'm still seeing failures to read the data I'm expecting though, 
as before. Note that I've run this with a smaller data set (2M rows, 1GB data 
total) for this latest test. 

Thanks, 
James 


-Original Message- 
From: Robert Coli [mailto:rc...@eventbrite.com] 
Sent: 18 June 2013 19:45 
To: user@cassandra.apache.org 
Subject: Re: Data not fully replicated with 2 nodes and replication factor 2 

On Tue, Jun 18, 2013 at 11:36 AM, Wei Zhu wz1...@yahoo.com wrote: 
 Cassandra doesn't do async replication like HBase does.You can run 
 nodetool repair to insure the consistency. 

While this answer is true, it is somewhat non-responsive to the OP. 

If the OP didn't see timeout exception, the theoretical worst case is that he 
should have hints stored for initially failed to replicate writes. His nodes 
should not be failing GC with a total data size of 5gb on an 8gb heap, so those 
hints should deliver quite quickly. After 
30 minutes those hints should certainly be delivered. 

@OP : do you see hints being stored? does nodetool tpstats indicate dropped 
messages? 

=Rob

Re: Data not fully replicated with 2 nodes and replication factor 2

2013-06-19 Thread Wei Zhu

Rob, 
Thanks. 
I was not aware of that. So we can avoid repair if there is no hardware 
failure...I found a blog: 

http://www.datastax.com/dev/blog/modern-hinted-handoff 

-Wei 

- Original Message -

From: Robert Coli rc...@eventbrite.com 
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com 
Sent: Wednesday, June 19, 2013 12:58:45 PM 
Subject: Re: Data not fully replicated with 2 nodes and replication factor 2 

On Wed, Jun 19, 2013 at 11:43 AM, Wei Zhu wz1...@yahoo.com wrote: 
 I think hints are only stored when the other node is down, not on the 
 dropped mutations. (Correct me if I am wrong, actually it's not a bad idea 
 to store hints for dropped mutations and replay them later?) 

This used to be the way it worked pre-1.0... 

https://issues.apache.org/jira/browse/CASSANDRA-2034 

In modern cassandra, anything but a successful ack from a coordinated 
write results in a hint on the coordinator. 

 To solve your issue, as I mentioned, either do nodetool repair, or increase 
 your consistency level. By the way, you probably write faster than your 
 cluster can handle if you see that many dropped mutations. 

If his hints are ultimately delivered, OP should not need repair to 
be consistent. 

=Rob

Re: Data not fully replicated with 2 nodes and replication factor 2

2013-06-18 Thread Wei Zhu

Cassandra doesn't do async replication like HBase does.You can run nodetool 
repair to insure the consistency. 

Or you can increase your Read or Write consistency. As long as R + W  RF, you 
have strong consistency. In your case, you can use CL.TWO for either read and 
write. 

-Wei 

- Original Message -

From: James Lee james@metaswitch.com 
To: user@cassandra.apache.org 
Sent: Tuesday, June 18, 2013 5:02:53 AM 
Subject: Data not fully replicated with 2 nodes and replication factor 2 



Hello, 

I’m seeing a strange problem with a 2-node Cassandra test deployment, where it 
seems that data isn’t being replicated among the nodes as I would expect. I 
suspect this may be a configuration issue of some kind, but have been unable to 
figure what I should change. 

The setup is as follows: 
· Two Cassandra nodes in the cluster (they each have themselves and the other 
node as seeds in cassandra.yaml). 
· Create 40 keyspaces, each with simple replication strategy and replication 
factor 2. 
· Populate 125,000 rows into each keyspace, using a pycassa client with a 
connection pool pointed at both nodes (I’ve verified that pycassa does indeed 
send roughly half the writes to each node). These are populated with writes 
using consistency level of 1. 
· Wait 30 minutes (to give replications a chance to complete). 
· Do random reads of the rows in the keyspaces, again using a pycassa client 
with a connection pool pointed at both nodes. These are read using consistency 
level 1. 

I’m finding that the vast majority of reads are successful, but a small 
proportion (~0.1%) are returned as Not Found. If I manually try to look up 
those keys using cassandra-cli, I see that they are returned when querying one 
of the nodes, but not when querying the other. So it seems like some of the 
rows have simply not been replicated. 

I’m not sure how I can monitor the status of ongoing replications, but the 
system has been idle for many 10s of minutes and the total database size is 
only about 5GB, so I don’t think there are any further ongoing operations. 

Any suggestions? In case it’s relevant, my setup is: 
· Cassandra 1.2.2, running on Linux 
· Sun Java 1.7.0_10-b18 64-bit 
· Java heap settings: -Xms8192M -Xmx8192M -Xmn2048M 

Thank you, 
James Lee

Re: Large number of files for Leveled Compaction

2013-06-16 Thread Wei Zhu

default value of 5MB is way too small in practice. Too many files in one 
directory is not a good thing. It's not clear what should be a good number. I 
have heard people are using 50MB, 75MB, even 100MB. Do your own test o find a 
right number. 

-Wei 

- Original Message -

From: Franc Carter franc.car...@sirca.org.au 
To: user@cassandra.apache.org 
Sent: Sunday, June 16, 2013 10:15:22 PM 
Subject: Re: Large number of files for Leveled Compaction 




On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali  mainalima...@gmail.com  
wrote: 



Not in the case of LeveledCompaction. Only SizeTieredCompaction merges smaller 
sstables into large ones. With the LeveledCompaction, the sstables are always 
of fixed size but they are grouped into different levels. 


You can refer to this page 
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on 
details of how LeveledCompaction works. 





Yes, but it seems I've misinterpreted that page ;-( 

I took this paragraph 


blockquote
In figure 3, new sstables are added to the first level, L0, and immediately 
compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are 
promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted 
with the sstables in L2 with which they overlap. As more data is added, leveled 
compaction results in a situation like the one shown in figure 4. 

/blockquote

to mean that once a level fills up it gets compacted into a higher level 

cheers 

blockquote



Cheers 
Manoj 





On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter  franc.car...@sirca.org.au  
wrote: 

blockquote

On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali  mainalima...@gmail.com  
wrote: 



blockquote

With LeveledCompaction, each sstable size is fixed and is defined by 
sstable_size_in_mb in the compaction configuration of CF definition and default 
value is 5MB. In you case, you may have not defined your own value, that is why 
your each sstable is 5MB. And if you dataset is huge, you will see a lot of 
sstable counts. 

/blockquote



Ok, seems like I do have (at least) an incomplete understanding. I realise that 
the minimum size is 5MB, but I thought compaction would merge these into a 
smaller number of larger sstables ? 

thanks 



blockquote




Cheers 


Manoj 




On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter  franc.car...@sirca.org.au  
wrote: 



blockquote

Hi, 

We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it 
may be a win for us. 

The first step of testing was to push a fairly large slab of data into the 
Column Family - we did this much faster ( x100) than we would in a production 
environment. This has left the Column Family with about 140,000 files in the 
Column Family directory which seems way too high. On two of the nodes the 
CompactionStats show 2 outstanding tasks and on a third node there are over 
13,000 outstanding tasks. However from looking at the log activity it looks 
like compaction has finished on all nodes. 

Is this number of files expected/normal ? 

cheers 

-- 

Franc Carter | Systems architect | Sirca Ltd 

franc.car...@sirca.org.au | www.sirca.org.au 
Tel: +61 2 8355 2514 

Level 4, 55 Harrington St, The Rocks NSW 2000 
PO Box H58, Australia Square, Sydney NSW 1215 


/blockquote


/blockquote





-- 

Franc Carter | Systems architect | Sirca Ltd 

franc.car...@sirca.org.au | www.sirca.org.au 
Tel: +61 2 8355 2514 

Level 4, 55 Harrington St, The Rocks NSW 2000 
PO Box H58, Australia Square, Sydney NSW 1215 


/blockquote


/blockquote



-- 

Franc Carter | Systems architect | Sirca Ltd 

franc.car...@sirca.org.au | www.sirca.org.au 
Tel: +61 2 8355 2514 

Level 4, 55 Harrington St, The Rocks NSW 2000 
PO Box H58, Australia Square, Sydney NSW 1215

Re: Large number of files for Leveled Compaction

2013-06-16 Thread Wei Zhu

Correction, the largest I heard is 256MB SSTable size. 

- Original Message -

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Sunday, June 16, 2013 10:28:25 PM 
Subject: Re: Large number of files for Leveled Compaction 

default value of 5MB is way too small in practice. Too many files in one 
directory is not a good thing. It's not clear what should be a good number. I 
have heard people are using 50MB, 75MB, even 100MB. Do your own test o find a 
right number. 

-Wei 

- Original Message -

From: Franc Carter franc.car...@sirca.org.au 
To: user@cassandra.apache.org 
Sent: Sunday, June 16, 2013 10:15:22 PM 
Subject: Re: Large number of files for Leveled Compaction 

On Mon, Jun 17, 2013 at 2:59 PM, Manoj Mainali  mainalima...@gmail.com  
wrote: 

Not in the case of LeveledCompaction. Only SizeTieredCompaction merges smaller 
sstables into large ones. With the LeveledCompaction, the sstables are always 
of fixed size but they are grouped into different levels. 

You can refer to this page 
http://www.datastax.com/dev/blog/leveled-compaction-in-apache-cassandra on 
details of how LeveledCompaction works. 

Yes, but it seems I've misinterpreted that page ;-( 

I took this paragraph 

blockquote
In figure 3, new sstables are added to the first level, L0, and immediately 
compacted with the sstables in L1 (blue). When L1 fills up, extra sstables are 
promoted to L2 (violet). Subsequent sstables generated in L1 will be compacted 
with the sstables in L2 with which they overlap. As more data is added, leveled 
compaction results in a situation like the one shown in figure 4. 

/blockquote

to mean that once a level fills up it gets compacted into a higher level 

cheers 

blockquote

Cheers 
Manoj 

On Mon, Jun 17, 2013 at 1:54 PM, Franc Carter  franc.car...@sirca.org.au  
wrote: 

blockquote

On Mon, Jun 17, 2013 at 2:47 PM, Manoj Mainali  mainalima...@gmail.com  
wrote: 

blockquote

With LeveledCompaction, each sstable size is fixed and is defined by 
sstable_size_in_mb in the compaction configuration of CF definition and default 
value is 5MB. In you case, you may have not defined your own value, that is why 
your each sstable is 5MB. And if you dataset is huge, you will see a lot of 
sstable counts. 

/blockquote

Ok, seems like I do have (at least) an incomplete understanding. I realise that 
the minimum size is 5MB, but I thought compaction would merge these into a 
smaller number of larger sstables ? 

thanks 

blockquote

Cheers 

Manoj 

On Fri, Jun 7, 2013 at 1:44 PM, Franc Carter  franc.car...@sirca.org.au  
wrote: 

blockquote

Hi, 

We are trialling Cassandra-1.2(.4) with Leveled compaction as it looks like it 
may be a win for us. 

The first step of testing was to push a fairly large slab of data into the 
Column Family - we did this much faster ( x100) than we would in a production 
environment. This has left the Column Family with about 140,000 files in the 
Column Family directory which seems way too high. On two of the nodes the 
CompactionStats show 2 outstanding tasks and on a third node there are over 
13,000 outstanding tasks. However from looking at the log activity it looks 
like compaction has finished on all nodes. 

Is this number of files expected/normal ? 

cheers 

-- 

Franc Carter | Systems architect | Sirca Ltd 

franc.car...@sirca.org.au | www.sirca.org.au 
Tel: +61 2 8355 2514 

Level 4, 55 Harrington St, The Rocks NSW 2000 
PO Box H58, Australia Square, Sydney NSW 1215 

/blockquote

/blockquote

-- 

Franc Carter | Systems architect | Sirca Ltd 

franc.car...@sirca.org.au | www.sirca.org.au 
Tel: +61 2 8355 2514 

Level 4, 55 Harrington St, The Rocks NSW 2000 
PO Box H58, Australia Square, Sydney NSW 1215 

/blockquote

/blockquote

-- 

Franc Carter | Systems architect | Sirca Ltd 

franc.car...@sirca.org.au | www.sirca.org.au 
Tel: +61 2 8355 2514 

Level 4, 55 Harrington St, The Rocks NSW 2000 
PO Box H58, Australia Square, Sydney NSW 1215

Re: High performance disk io

2013-05-22 Thread Wei Zhu

For us, the biggest killer is repair and compaction following repair. If you 
are running VNodes, you need to test the performance while running repair. 

- Original Message -

From: Igor i...@4friends.od.ua 
To: user@cassandra.apache.org 
Sent: Wednesday, May 22, 2013 7:48:34 AM 
Subject: Re: High performance disk io 

On 05/22/2013 05:41 PM, Christopher Wirt wrote: 

Hi Igor, 

Yea same here, 15ms for 99 th percentile is our max. Currently getting one or 
two ms for most CF. It goes up at peak times which is what we want to avoid. 

Our 99 percentile also goes up at peak times but stay at acceptable level. 

blockquote

We’re using Cass 1.2.4 w/vnodes and our own barebones driver on top of thrift. 
Needed to be .NET so Hector and Astyanax were not options. 

/blockquote
Astyanax is token-aware, so we avoid extra data hops between cassandra nodes. 

blockquote

Do you use SSDs or multiple SSDs in any kind of configuration or RAID? 
/blockquote

No, single SSD per host 

blockquote

Thanks 

Chris 

From: Igor [ mailto:i...@4friends.od.ua ] 
Sent: 22 May 2013 15:07 
To: user@cassandra.apache.org 
Subject: Re: High performance disk io 

Hello 

What level of read performance do you expect? We have limit 15 ms for 99 
percentile with average read latency near 0.9ms. For some CF 99 percentile 
actually equals to 2ms, for other - to 10ms, this depends on the data volume 
you read in each query. 

Tuning read performance involved cleaning up data model, tuning cassandra.yaml, 
switching from Hector to astyanax, tuning OS parameters. 

On 05/22/2013 04:40 PM, Christopher Wirt wrote: 
blockquote

Hello, 

We’re looking at deploying a new ring where we want the best possible read 
performance. 

We’ve setup a cluster with 6 nodes, replication level 3, 32Gb of memory, 8Gb 
Heap, 800Mb keycache, each holding 40/50Gb of data on a 200Gb SSD and 500Gb 
SATA for OS and commitlog 
Three column families 
ColFamily1 50% of the load and data 
ColFamily2 35% of the load and data 
ColFamily3 15% of the load and data 

At the moment we are still seeing around 20% disk utilisation and occasionally 
as high as 40/50% on some nodes at peak time.. we are conducting some semi live 
testing. 
CPU looks fine, memory is fine, keycache hit rate is about 80% (could be 
better, so maybe we should be increasing the keycache size?) 

Anyway, we’re looking into what we can do to improve this. 

One conversion we are having at the moment is around the SSD disk setup.. 

We are considering moving to have 3 smaller SSD drives and spreading the data 
across those. 

The possibilities are: 
-We have a RAID0 of the smaller SSDs and hope that improves performance. 
Will this acutally yield better throughput? 

-We mount the SSDs to different directories and define multiple data 
directories in Cassandra.yaml. 
Will not having a layer of RAID controller improve the throughput? 

-We mount the SSDs to different columns family directories and have a single 
data directory declared in Cassandra.yaml. 
Think this is quite attractive idea. 
What are the drawbacks? System column families will be on the main SATA? 

-We don’t change anything and just keep upping our keycache. 
-Anything you guys can think of. 

Ideas and thoughts welcome. Thanks for your time and expertise. 

Chris 

/blockquote

/blockquote

Re: High performance disk io

2013-05-22 Thread Wei Zhu

without VNodes, during repair -pr, it will stream data for all the replicates 
and repair all of them. So it will impact RF number of nodes. 
In the case of VNodes, the streaming/compaction should happen to all the 
physical nodes. I heard the repair is even worse for VNodes Test it and see 
how it goes. 

-Wei 
- Original Message -

From: Dean Hiller dean.hil...@nrel.gov 
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com 
Sent: Wednesday, May 22, 2013 12:19:44 PM 
Subject: Re: High performance disk io 

If you are only running repair on one node, should it not skip that node? So 
there should be no performance hit except when doing CL_ALL of course. We had 
to make a change to cassandra or slow nodes did impact us previously. 

Dean 

From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com 
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu 
wz1...@yahoo.commailto:wz1...@yahoo.com 
Date: Wednesday, May 22, 2013 1:16 PM 
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Subject: Re: High performance disk io 

For us, the biggest killer is repair and compaction following repair. If you 
are running VNodes, you need to test the performance while running repair. 

 
From: Igor i...@4friends.od.uamailto:i...@4friends.od.ua 
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Sent: Wednesday, May 22, 2013 7:48:34 AM 
Subject: Re: High performance disk io 

On 05/22/2013 05:41 PM, Christopher Wirt wrote: 
Hi Igor, 

Yea same here, 15ms for 99th percentile is our max. Currently getting one or 
two ms for most CF. It goes up at peak times which is what we want to avoid. 

Our 99 percentile also goes up at peak times but stay at acceptable level. 

We’re using Cass 1.2.4 w/vnodes and our own barebones driver on top of thrift. 
Needed to be .NET so Hector and Astyanax were not options. 

Astyanax is token-aware, so we avoid extra data hops between cassandra nodes. 

Do you use SSDs or multiple SSDs in any kind of configuration or RAID? 

No, single SSD per host 


Thanks 

Chris 

From: Igor [mailto:i...@4friends.od.ua] 
Sent: 22 May 2013 15:07 
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Subject: Re: High performance disk io 

Hello 

What level of read performance do you expect? We have limit 15 ms for 99 
percentile with average read latency near 0.9ms. For some CF 99 percentile 
actually equals to 2ms, for other - to 10ms, this depends on the data volume 
you read in each query. 

Tuning read performance involved cleaning up data model, tuning cassandra.yaml, 
switching from Hector to astyanax, tuning OS parameters. 

On 05/22/2013 04:40 PM, Christopher Wirt wrote: 
Hello, 

We’re looking at deploying a new ring where we want the best possible read 
performance. 

We’ve setup a cluster with 6 nodes, replication level 3, 32Gb of memory, 8Gb 
Heap, 800Mb keycache, each holding 40/50Gb of data on a 200Gb SSD and 500Gb 
SATA for OS and commitlog 
Three column families 
ColFamily1 50% of the load and data 
ColFamily2 35% of the load and data 
ColFamily3 15% of the load and data 

At the moment we are still seeing around 20% disk utilisation and occasionally 
as high as 40/50% on some nodes at peak time.. we are conducting some semi live 
testing. 
CPU looks fine, memory is fine, keycache hit rate is about 80% (could be 
better, so maybe we should be increasing the keycache size?) 

Anyway, we’re looking into what we can do to improve this. 

One conversion we are having at the moment is around the SSD disk setup.. 

We are considering moving to have 3 smaller SSD drives and spreading the data 
across those. 

The possibilities are: 
-We have a RAID0 of the smaller SSDs and hope that improves performance. 
Will this acutally yield better throughput? 

-We mount the SSDs to different directories and define multiple data 
directories in Cassandra.yaml. 
Will not having a layer of RAID controller improve the throughput? 

-We mount the SSDs to different columns family directories and have a single 
data directory declared in Cassandra.yaml. 
Think this is quite attractive idea. 
What are the drawbacks? System column families will be on the main SATA? 

-We don’t change anything and just keep upping our keycache. 
-Anything you guys can think of. 

Ideas and thoughts welcome. Thanks for your time and expertise. 

Chris

Re: any way to get the #writes/second, reads per second

2013-05-14 Thread Wei Zhu

We have a long running script which wakes up every minute to get reads/writes 
through JMX. It does the calculation to get r/s and w/s and send them to 
ganglia. We are thinking of using graphite which comes with some sort of 
intelligence mentioned by Tomàs, but it's just too big of a change for our 
infrastructure. 

-Wei 

- Original Message -

From: Dean Hiller dean.hil...@nrel.gov 
To: user@cassandra.apache.org 
Sent: Tuesday, May 14, 2013 4:37:14 AM 
Subject: Re: any way to get the #writes/second, reads per second 

Cool, thanks, 
Dean 

From: Tomàs Núnez tomas.nu...@groupalia.commailto:tomas.nu...@groupalia.com 
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Date: Tuesday, May 14, 2013 4:53 AM 
To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Subject: Re: any way to get the #writes/second, reads per second 

Yes, there isn't any place in the JMX with reads/second or writes/second, just 
CompletedTasks. I send this info to Graphite (http://graphite.wikidot.com/) and 
use the derivative function to get reads/minute. Munin also does the trick 
(https://github.com/tcurdt/jmx2munin). But you can't do that with cassandra 
itself, you need somewhere to make the calculations (cacti is also a good 
match). 

Hope that helps. There is some more information about monitoring this kind of 
things here: 
http://www.tomas.cat/blog/en/monitoring-cassandra-relevant-data-should-be-watched-and-how-send-it-graphite
 



2013/5/13 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov 
We running a pretty consistent load on our cluster and added a new node to a 6 
node cluster Friday(QA worked great, but production not so much). One mistake 
that was made was starting up the new node, then disabling the firewall :( 
which allowed nodes to discover it BEFORE the node bootstrapped itself. We 
shutdown the node and booted him up and he bootstrapped himself streaming all 
the data in. 

After that though, all the ndoes have really really high load numbers now. We 
are trying to figure out what is going on still. 

Is there any way to get the number of reads/second and writes/second through 
JMX or something? The only way I can see of on doing this is manually 
calculating it by timing the read count and dividing by my manual stop watches 
start/stop times(timerange). 

Also, while my load is load average: 20.31, 19.10, 19.72 , what does a normal 
iostat look like? My iostat await time is 13.66 ms which I think is kind of 
bad, but not that bad to cause a load of 20.31? 

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util 
sda 0.02 0.07 11.70 1.96 1353.67 702.88 150.58 0.19 13.66 3.61 4.93 
sdb 0.00 0.02 0.11 0.46 20.72 97.54 206.70 0.00 1.33 0.67 0.04 

Thanks, 
Dean 



-- 
[cid:part1.00030401.01090104@groupalia.com]http://es.groupalia.com/ 
www.groupalia.comhttp://es.groupalia.com/ 
Tomàs Núñez 
IT-Sysprod 
Tel. + 34 93 159 31 00 
Fax. + 34 93 396 18 52 
Llull, 95-97, 2º planta, 08005 Barcelona 
Skype: tomas.nunez.groupalia 
tomas.nu...@groupalia.commailto:nombre.apell...@groupalia.com 
[cid:part2.06060004.09060102@groupalia.com] 
Twitterhttp://twitter.com/#%21/groupaliaes 
Facebookhttps://www.facebook.com/GroupaliaEspana 
[cid:part4.03040807.03080505@groupalia.com] 
Linkedinhttp://www.linkedin.com/company/groupalia

Re: (unofficial) Community Poll for Production Operators : Repair

2013-05-14 Thread Wei Zhu

1) 1.1.6 on 5 nodes, 24CPU, 72 RAM 
2) local quorum (we only have one DC though). We do delete through TTL 
3) yes 
4) once a week rolling repairs -pr using cron job 
5) it definitely has negative impact on the performance. Our data size is 
around 100G per node and during repair it brings in additional 60G - 80G data 
and created about 7K compaction (We use LCS with SSTable size of 10M which was 
a mistake we made at the beginning). It takes more than a day for the 
compaction tasks to clear and by then the next compaction starts. We had to set 
client side (Hector) timeout to deal with it and the SLA is still under control 
for now. 
But we had to halt go live for another cluster due to the unanticipated 
double the space during the repair. 

Per Dean's question to simulate the slow response, someone in the IRC mentioned 
a trick to start Cassandra with -f and ctrl-z and it works for our test. 

-Wei 
- Original Message -

From: Dean Hiller dean.hil...@nrel.gov 
To: user@cassandra.apache.org 
Sent: Tuesday, May 14, 2013 4:48:02 AM 
Subject: Re: (unofficial) Community Poll for Production Operators : Repair 

We had to roll out a fix in cassandra as a slow node was slowing down our 
clients of cassandra in 1.2.2 for some reason. Every time we had a slow node, 
we found out fast as performance degraded. We tested this in QA and had the 
same issue. This means a repair made that node slow which made our clients 
slow. With this fix which I think one our team is going to try to get it back 
into cassandra, the slow node does not affect our clients anymore. 

I am curious though, if someone else would use the tc program to simulate 
linux packet delay on a single node, does your client's response time get much 
slower? We simulated a 500ms delay on the node to simulate the slow node….it 
seems the co-ordinator node was incorrectly waiting for BOTH responses on 
CL_QUOROM instead of just one (as itself was one as well) or something like 
that. (I don't know too much as my colleague was the one that debugged this 
issue) 

Dean 

From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com 
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Date: Tuesday, May 14, 2013 1:42 AM 
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org 
Subject: Re: (unofficial) Community Poll for Production Operators : Repair 

Hi Rob, 

1) 1.2.2 on 6 to 12 EC2 m1.xlarge 
2) Quorum RW . Almost no deletes (just some TTL) 
3) Yes 
4) On each node once a week (rolling repairs using crontab) 
5) The only behavior that is quite odd or unexplained to me is why a repair 
doesn't fix a counter mismatch between 2 nodes. I mean when I read my counters 
with a CL.One I have inconsistency (the counter value may change anytime I read 
it, depending, I guess, on what node I read from. Reading with CL.Quorum fixes 
this bug, but the data is still wrong on some nodes. About performance, it's 
quite expensive to run a repair but doing it in a low charge period and in a 
rolling fashion works quite well and has no impact on the service. 

Hope this will help somehow. Let me know if you need more information. 

Alain 



2013/5/10 Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com 
Hi! 

I have been wondering how Repair is actually used by operators. If 
people operating Cassandra in production could answer the following 
questions, I would greatly appreciate it. 

1) What version of Cassandra do you run, on what hardware? 
2) What consistency level do you write at? Do you do DELETEs? 
3) Do you run a regularly scheduled repair? 
4) If you answered yes to 3, what is the frequency of the repair? 
5) What has been your subjective experience with the performance of 
repair? (Does it work as you would expect? Does its overhead have a 
significant impact on the performance of your cluster?) 

Thanks! 

=Rob

Re: compaction throughput rate not even close to 16MB

2013-04-24 Thread Wei Zhu

Same here. We disable the throttling and our disk and CPU usage both low ( 
10%) and still takes hours for LCS compaction to finish after a repair. For 
this cluster, we don't delete any data, so we can rule out tombstones. Not sure 
what is holding compaction back. My observation is that for the LCS which 
involves large number of SSTables (since we set SSTable size too small at 10M 
and sometimes one compactions involves up to 10 G of data = 1000 SSTables), the 
throughout put is smaller. So my theory is that open/close file handlers have 
substantial impact on the throughput. 

By the way, we are on SSD.

-Wei


 From: Hiller, Dean dean.hil...@nrel.gov
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Wednesday, April 24, 2013 1:37 PM
Subject: Re: compaction throughput rate not even close to 16MB
 

Thanks much!!!  Better to hear at least one other person sees the same thing 
;).  Sometimes these posts just go silent.

Dean

From: Edward Capriolo edlinuxg...@gmail.commailto:edlinuxg...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wednesday, April 24, 2013 2:33 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: compaction throughput rate not even close to 16MB

I have noticed the same. I think in the real world your compaction throughput 
is limited by other things. If I had to speculate I would say that compaction 
can remove expired tombstones, however doing this requires bloom filter checks, 
etc.

I think that setting is more important with multi threaded compaction and/or 
more compaction slots. In those cases it may actually throttle something.


On Wed, Apr 24, 2013 at 3:54 PM, Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote:
I was wondering about the compactionthroughput.  I never see ours get even 
close to 16MB and I thought this is supposed to throttle compaction, right?  
Ours is constantly less than 3MB/sec from looking at our logs or do I have this 
totally wrong?  How can I see the real throughput so that I can understand how 
to throttle it when I need to?

94,940,780 bytes to 95,346,024 (~100% of original) in 38,438ms = 2.365603MB/s.  
2,350,114 total rows, 2,350,022 unique.  Row merge counts were {1:2349930, 
2:92, }

Thanks,
Dean

move data from Cassandra 1.1.6 to 1.2.4

2013-04-23 Thread Wei Zhu

Hi,
We are trying to upgrade from 1.1.6 to 1.2.4, it's not really a live upgrade. 
We are going to retire the old hardware and bring in a set of new hardware for 
1.2.4. 
For old cluster, we have 5 nodes with RF = 3, total of 1TB data.
For new cluster, we will have 10 nodes with RF = 3. We will use VNodes. What is 
the best way to bring the data from 1.1.6 to 1.2.4? A couple of concerns:
* We also use LCS and plan to increase SSTable size. 

* We use randomPartitioner, we should stick with it, not to mess up 
with murmur3?

Thanks for your feedback.

-Wei

Re: move data from Cassandra 1.1.6 to 1.2.4

2013-04-23 Thread Wei Zhu

Hi Dean,
It's a bit different case for us. We will have a set of new machines to replace 
the old ones and we want to migrate those data over. I would imagine to do 
something like
* Let new nodes (with VNodes) join the cluster
* decommission the old nodes. (Without VNodes)
Thanks.
-Wei



 From: Hiller, Dean dean.hil...@nrel.gov
To: user@cassandra.apache.org user@cassandra.apache.org; Wei Zhu 
wz1...@yahoo.com 
Sent: Tuesday, April 23, 2013 11:17 AM
Subject: Re: move data from Cassandra 1.1.6 to 1.2.4
 

We went from 1.1.4 to 1.2.2 and in QA rolling restart failed but in production 
and QA bringing down the whole cluster upgrading every node and then bringing 
it back up worked fine.  We left ours at randompartitioner and had LCS as well. 
 We did not convert to Vnodes at all.  Don't know if it helps at all, but it is 
a similar case I would think.

Dean

From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu 
wz1...@yahoo.commailto:wz1...@yahoo.com
Date: Tuesday, April 23, 2013 12:11 PM
To: Cassandr usergroup 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: move data from Cassandra 1.1.6 to 1.2.4

Hi,
We are trying to upgrade from 1.1.6 to 1.2.4, it's not really a live upgrade. 
We are going to retire the old hardware and bring in a set of new hardware for 
1.2.4.
For old cluster, we have 5 nodes with RF = 3, total of 1TB data.
For new cluster, we will have 10 nodes with RF = 3. We will use VNodes. What is 
the best way to bring the data from 1.1.6 to 1.2.4? A couple of concerns:

*   We also use LCS and plan to increase SSTable size.
*   We use randomPartitioner, we should stick with it, not to mess up with 
murmur3?

Thanks for your feedback.

-Wei

Re: How to make compaction run faster?

2013-04-18 Thread Wei Zhu

We have tried very hard to speed up lcs on 1.1.6 with no luck. It seems to be 
single threaded and not much parallelism you can achieve. 1.2 does come with 
parallel lcs which should help. 
One more thing to try is to enlarge the sstable size which will reduce the 
number of SSTable. It *might* help the lcs. 


-Wei 
- Original Message -

From: Alexis Rodríguez arodrig...@inconcertcc.com 
To: user@cassandra.apache.org 
Sent: Thursday, April 18, 2013 11:03:13 AM 
Subject: Re: How to make compaction run faster? 



Jay, 


await , according to iostat's man page it is the time of a request to the disk 
to get served. You may try changing the io scheduler. I've read that noop it's 
recommended for SSDs, you can check here http://goo.gl/XMiIA 


Regarding compaction, a week ago we had serious problems with compaction in a 
test machine, solved by changing from openjdk 1.6 to sun-jdk 1.6. 






On Thu, Apr 18, 2013 at 2:08 PM, Jay Svc  jaytechg...@gmail.com  wrote: 




By the way the compaction and commit log disk latency, these are two seperate 
problems I see. 

The important one is compaction problem, How I can speed that up? 

Thanks, 
Jay 





On Thu, Apr 18, 2013 at 12:07 PM, Jay Svc  jaytechg...@gmail.com  wrote: 

blockquote


Looks like formatting is bit messed up. Please let me know if you want the same 
in clean format. 

Thanks, 
Jay 





On Thu, Apr 18, 2013 at 12:05 PM, Jay Svc  jaytechg...@gmail.com  wrote: 

blockquote


Hi Aaron, Alexis, 

Thanks for reply, Please find some more details below. 


Core problems: Compaction is taking longer time to finish. So it will affect my 
reads. I have more CPU and memory, want to utilize that to speed up the 
compaction process. 

Parameters used: 


1. SSTable size: 500MB (tried various sizes from 20MB to 1GB) 
2. Compaction throughput mb per sec: 250MB (tried from 16MB to 640MB) 
3. Concurrent write: 196 (tried from 32 to 296) 
4. Concurrent compactors: 72 (tried disabling to making it 172) 
5. Multithreaded compaction: true (tried both true and false) 
6. Compaction strategy: LCS (tried STCS as well) 
7. Memtable total space in mb: 4096 MB (tried default and some other params 
too) 


Note: I have tried almost all permutation combination of these parameters. 

Observations: 
I ran test for 1.15 hrs with writes at the rate of 21000 records/sec(total 60GB 
data during 1.15 hrs). After I stopped the test 
compaction took additional 1.30 hrs to finish compaction, that reduced the 
SSTable count from 170 to 17. 

CPU(24 cores): almost 80% idle during the run 
JVM: 48G RAM, 8G Heap, (3G to 5G heap used) 
Pending Writes: sometimes high spikes for small amount of time otherwise pretty 
flat 


Aaron, Please find the iostat below: the sdb and dm-2 are the commitlog disks. 

Please find the iostat of some of 3 different boxes in my cluster. 

-bash-4.1$ iostat -xkcd 
Linux 2.6.32-358.2.1.el6.x86_64 (edc-epod014-dl380-3) 04/18/2013 _x86_64_ (24 
CPU) 

avg-cpu: %user %nice %system %iowait %steal %idle 
1.20 1.11 0.59 0.01 0.00 97.09 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 
sda 0.03 416.56 9.00 7.08 1142.49 1694.55 352.88 0.07 4.08 0.57 0.92 
sdb 0.00 172.38 0.08 3.34 10.76 702.89 416.96 0.09 24.84 0.94 0.32 
dm-0 0.00 0.00 0.03 0.75 0.62 3.00 9.24 0.00 1.45 0.33 0.03 
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.74 0.68 0.00 
dm-2 0.00 0.00 0.08 175.72 10.76 702.89 8.12 3.26 18.49 0.02 0.32 
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 7.97 0.00 0.83 0.62 0.00 
dm-4 0.00 0.00 8.99 422.89 1141.87 1691.55 13.12 4.64 10.71 0.02 0.90 


-bash-4.1$ iostat -xkcd 
Linux 2.6.32-358.2.1.el6.x86_64 (ndc-epod014-dl380-1) 04/18/2013 _x86_64_ (24 
CPU) 

avg-cpu: %user %nice %system %iowait %steal %idle 
1.20 1.12 0.52 0.01 0.00 97.14 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svc 
sda 0.01 421.17 9.22 7.43 1167.81 1714.38 346.10 0.07 3.99 0. 
sdb 0.00 172.68 0.08 3.26 10.52 703.74 427.79 0.08 25.01 0. 
dm-0 0.00 0.00 0.04 1.04 0.89 4.16 9.34 0.00 2.58 0. 
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.77 0. 
dm-2 0.00 0.00 0.08 175.93 10.52 703.74 8.12 3.13 17.78 0. 
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 7.97 0.00 1.14 0. 
dm-4 0.00 0.00 9.19 427.55 1166.91 1710.21 13.18 4.67 10.65 0. 

-bash-4.1$ iostat -xkcd 
Linux 2.6.32-358.2.1.el6.x86_64 (edc-epod014-dl380-1) 04/18/2013 _x86_64_ (24 
CPU) 

avg-cpu: %user %nice %system %iowait %steal %idle 
1.15 1.13 0.52 0.01 0.00 97.19 

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util 
sda 0.02 429.97 9.28 7.29 1176.81 1749.00 353.12 0.07 4.10 0.55 0.91 
sdb 0.00 173.65 0.08 3.09 10.50 706.96 452.25 0.09 27.23 0.99 0.31 
dm-0 0.00 0.00 0.04 0.79 0.82 3.16 9.61 0.00 1.54 0.27 0.02 
dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 0.68 0.63 0.00 
dm-2 0.00 0.00 0.08 176.74 10.50 706.96 8.12 3.46 19.53 0.02 0.31 
dm-3 0.00 0.00 0.00 0.00 0.00 0.00 7.97 0.00 0.85 0.83 0.00 
dm-4 0.00 0.00 9.26 436.46 1175.98 1745.84 13.11

Re: lots of extra bytes on disk

2013-03-28 Thread Wei Zhu

Hi Ben,
If affordable, just blow away the node and bootstrap in a replacement/ or 
restore from snapshot and repair. 

-Wei

- Original Message -
From: Dean Hiller dean.hil...@nrel.gov
To: user@cassandra.apache.org
Sent: Thursday, March 28, 2013 11:40:21 AM
Subject: Re: lots of extra bytes on disk

Oh and since our LCS was 10MB per file it was easy to tell which files did
not convert yet.  Also, we ended up blowing away a CF on node 5(of 6) and
running a full repair on that CF and after he was at a normal size again
as well.

Dean

On 3/28/13 12:35 PM, Hiller, Dean dean.hil...@nrel.gov wrote:

We had a runaway STCS like this due to our own mistakes but were not sure
how to clean it up.  We went to LCS instead of STCS and that seemed to
bring it way back down since the STCS had repeats and such between
SSTables which LCS avoids mostly.  I can't help much more than that info
though.

Dean

On 3/28/13 12:31 PM, Ben Chobot be...@instructure.com wrote:

Sorry to make it confusing. I didn't have snapshots on some nodes; I just
made a snapshot on a node with this problem.

So to be clear, on this one example node
 Cassandra reports ~250GB of space used
 In a CF data directory (before snapshots existed), du -sh showed ~550GB
 After the snapshot, du in the same directory still showed ~550GB
(they're hard links, so that's correct)
 du in the snapshot directory for that CF shows ~250GB, and ls shows ~50
fewer files.

On Mar 28, 2013, at 11:10 AM, Hiller, Dean wrote:

 I am confused.  I thought you said you don't have a snapshot.  Df/du
 reports space used by existing data AND the snapshot.  Cassandra only
 reports on space used by actual dataif you move the snapshots,
does
 df/du match what cassandra says?

 Dean

 On 3/28/13 12:05 PM, Ben Chobot be...@instructure.com wrote:

 .though interestingly, the snapshot of these CFs have the right
 amount of data in them (i.e. it agrees with the live SSTable size
 reported by cassandra). Is it total insanity to remove the files from
the
 data directory not included in the snapshot, so long as they were
created
 before the snapshot?

 On Mar 28, 2013, at 10:54 AM, Hiller, Dean wrote:

 Have you cleaned up your snapshotsŠthose take extra space and don't
just
 go away unless you delete them.

 Dean

 On 3/28/13 11:46 AM, Ben Chobot be...@instructure.com wrote:

 Are you also running 1.1.5? I'm wondering (ok hoping) that this
might
 be
 fixed if I upgrade.

 On Mar 28, 2013, at 8:53 AM, Lanny Ripple wrote:

 We occasionally (twice now on a 40 node cluster over the last 6-8
 months) see this.  My best guess is that Cassandra can fail to mark
an
 SSTable for cleanup somehow.  Forced GC's or reboots don't clear
them
 out.  We disable thrift and gossip; drain; snapshot; shutdown;
clear
 data/Keyspace/Table/*.db and restore (hard-linking back into place
to
 avoid data transfer) from the just created snapshot; restart.

 On Mar 28, 2013, at 10:12 AM, Ben Chobot be...@instructure.com
 wrote:

 Some of my cassandra nodes in my 1.1.5 cluster show a large
 discrepancy between what cassandra says the SSTables should sum up
 to,
 and what df and du claim exist. During repairs, this is almost
always
 pretty bad, but post-repair compactions tend to bring those
numbers
 to
 within a few percent of each other... usually. Sometimes they
remain
 much further apart after compactions have finished - for instance,
 I'm
 looking at one node now that claims to have 205GB of SSTables, but
 actually has 450GB of files living in that CF's data directory. No
 pending compactions, and the most recent compaction for this CF
 finished just a few hours ago.

 nodetool cleanup has no effect.

 What could be causing these extra bytes, and how to get them to go
 away? I'm ok with a few extra GB of unexplained data, but an extra
 245GB (more than all the data this node is supposed to have!) is a
 little extreme.

Re: bloom filter fp ratio of 0.98 with fp_chance of 0.01

2013-03-27 Thread Wei Zhu

 sstables after changing the FP chance 


Thanks! 
Andras 
---BeginMessage---
I'm still wondering about how to chose the size of the sstable under LCS.
Default is 5MB, people use to configure it to 10MB and now you configure it
at 128MB. What are the benefits or disadvantages of a very small size
(let's say 5 MB) vs big size (like 128MB) ?

This seems to be the biggest question about LCS, and it is still
unanswered. Does anyone (commiters maybe) know about it ?

It would help at least us 5, but probably more people.

Alain


2013/3/8 Michael Theroux mthero...@yahoo.com

 I've asked this myself in the past... fairly arbitrarily chose 10MB based
 on Wei's experience,

 -Mike

 On Mar 8, 2013, at 1:50 PM, Hiller, Dean wrote:

  +1  (I would love to know this info).
 
  Dean
 
  From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com
  Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
 user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu 
 wz1...@yahoo.commailto:wz1...@yahoo.com
  Date: Friday, March 8, 2013 11:11 AM
  To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Subject: Re: Size Tiered - Leveled Compaction
 
  I have the same wonder.
  We started with the default 5M and the compaction after repair takes too
 long on 200G node, so we increase the size to 10M sort of arbitrarily since
 there is not much documentation around it. Our tech op team still thinks
 there are too many files in one directory. To fulfill the guidelines from
 them (don't remember the exact number, but something in the range of 50K
 files), we will need to increase the size to around 50M. I think the
 latency of  opening one file is not impacted much by the number of files in
 one directory for the modern file system. But ls and other operations
 suffer.
 
  Anyway, I asked about the side effect of the bigger SSTable in IRC,
 someone was mentioning during read, C* reads the whole SSTable from disk in
 order to access the row which causes more disk IO compared with the smaller
 SSTable. I don't know enough about the internal of the Cassandra, not sure
 whether it's the case or not. If that is the case (with question mark) ,
 the SSTable or the row is kept in the memory? Hope someone can confirm the
 theory here. Or I have to dig in to the source code to find it.
 
  Another concern is during repair, does it stream the whole SSTable or
 the partial of it when mismatch is detected? I see the claim for both, can
 someone please confirm also?
 
  The last thing is the effectiveness of the parallel LCS on 1.2. It takes
 quite some time for the compaction to finish after repair for LCS for
 1.1.X. Both CPU and disk Util is low during the compaction which means LCS
 doesn't fully utilized resource.  It will make the life easier if the issue
 is addressed in 1.2.
 
  Bottom line is that there is not much documentation/guideline/successful
 story around LCS although it sounds beautiful on paper.
 
  Thanks.
  -Wei
  
  From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com
  To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
  Cc: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com
  Sent: Friday, March 8, 2013 1:25 AM
  Subject: Re: Size Tiered - Leveled Compaction
 
  I'm still wondering about how to chose the size of the sstable under
 LCS. Defaul is 5MB, people use to configure it to 10MB and now you
 configure it at 128MB. What are the benefits or inconveniants of a very
 small size (let's say 5 MB) vs big size (like 128MB) ?
 
  Alain
 
 
  2013/3/8 Al Tobey a...@ooyala.commailto:a...@ooyala.com
  We saw the exactly the same thing as Wei Zhu,  100k tables in a
 directory causing all kinds of issues.  We're running 128MiB ssTables with
 LCS and have disabled compaction throttling.  128MiB was chosen to get file
 counts under control and reduce the number of files C* has to manage 
 search. I just looked and a ~250GiB node is using about 10,000 files, which
 is quite manageable.  This configuration is running smoothly in production
 under mixed read/write load.
 
  We're on RAID0 across 6 15k drives per machine. When we migrated data to
 this cluster we were pushing well over 26k/s+ inserts with CL_QUORUM. With
 compaction throttling enabled at any rate it just couldn't keep up. With
 throttling off, it runs smoothly and does not appear to have an impact on
 our applications, so we always leave it off, even in EC2.  An 8GiB heap is
 too small for this config on 1.1. YMMV.
 
  -Al Tobey
 
  On Thu, Feb 14, 2013 at 12:51 PM, Wei Zhu wz1...@yahoo.commailto:
 wz1...@yahoo.com wrote:
  I haven't tried to switch compaction strategy. We started with LCS.
 
  For us, after massive data imports (5000 w/seconds for 6 days), the
 first repair is painful since there is quite some data inconsistency. For
 150G nodes, repair brought in about 30 G and created thousands of pending
 compactions. It took almost

Re: nodetool repair hung?

2013-03-25 Thread Wei Zhu

check nodetool tpstats and looking for AntiEntropySessions/AntiEntropyStages
grep the log and looking for repair and merkle tree

- Original Message -
From: S C as...@outlook.com
To: user@cassandra.apache.org
Sent: Monday, March 25, 2013 2:55:30 PM
Subject: nodetool repair hung?

I am using Cassandra 1.1.5. 

nodetool repair is not coming back on the command line. Did it ran 
successfully? Did it hang? How do you find if the repair was successful? 
I did not find anything in the logs.nodetool compactionstats and nodetool 
netstats are clean. 

nodetool compactionstats 
pending tasks: 0 
Active compaction remaining time : n/a 

nodetool netstats 
Mode: NORMAL 
Not sending any streams. 
Not receiving any streams. 
Pool Name Active Pending Completed 
Commands n/a 0 121103621 
Responses n/a 0 209564496

Re: High disk I/O during reads

2013-03-22 Thread Wei Zhu

According to your cfstats, read latency is over 100 ms which is really really 
slow. I am seeing less than 3ms reads for my cluster which is on SSD. Can you 
also check the nodetool cfhistorgram, it tells you more about the number of 
SSTable involved and read/write latency. Somtimes average doesn't tell you the 
whole storey. 
Also check your nodetool tpstats, are there a lot dropped reads?

-Wei
- Original Message -
From: Jon Scarborough j...@fifth-aeon.net
To: user@cassandra.apache.org
Sent: Friday, March 22, 2013 9:42:34 AM
Subject: Re: High disk I/O during reads

Key distribution across probably varies a lot from row to row in our case. Most 
reads would probably only need to look at a few SSTables, a few might need to 
look at more. 

I don't yet have a deep understanding of C* internals, but I would imagine even 
the more expensive use cases would involve something like this: 

1) Check the index for each SSTable to determine if part of the row is there. 
2) Look at the endpoints of the slice to determine if the data in a particular 
SSTable is relevant to the query. 
3) Read the chunks of those SSTables, working backwards from the end of the 
slice until enough columns have been read to satisfy the limit clause in the 
query. 

So I would have guessed that even the more expensive queries on wide rows 
typically wouldn't need to read more than a few hundred KB from disk to do all 
that. Seems like I'm missing something major. 

Here's the complete CF definition, including compression settings: 

CREATE COLUMNFAMILY conversation_text_message ( 
conversation_key bigint PRIMARY KEY 
) WITH 
comment='' AND 
comparator='CompositeType(org.apache.cassandra.db.marshal.DateType,org.apache.cassandra.db.marshal.LongType,org.apache.cassandra.db.marshal.AsciiType,org.apache.cassandra.db.marshal.AsciiType)'
 AND 
read_repair_chance=0.10 AND 
gc_grace_seconds=864000 AND 
default_validation=text AND 
min_compaction_threshold=4 AND 
max_compaction_threshold=32 AND 
replicate_on_write=True AND 
compaction_strategy_class='SizeTieredCompactionStrategy' AND 
compression_parameters:sstable_compression='org.apache.cassandra.io.compress.SnappyCompressor';
 

Much thanks for any additional ideas. 

-Jon 



On Fri, Mar 22, 2013 at 8:15 AM, Hiller, Dean  dean.hil...@nrel.gov  wrote: 


Did you mean to ask are 'all' your keys spread across all SSTables? I am 
guessing at your intention. 

I mean I would very well hope my keys are spread across all sstables or 
otherwise that sstable should not be there as he has no keys in it ;). 

And I know we had HUGE disk size from the duplication in our sstables on 
size-tiered compaction….we never ran a major compaction but after we switched 
to LCS, we went from 300G to some 120G or something like that which was nice. 
We only have 300 data point posts / second so not an extreme write load on 6 
nodes as well though these posts causes read to check authorization and such of 
our system. 

Dean 

From: Kanwar Sangha  kan...@mavenir.com mailto: kan...@mavenir.com  
Reply-To:  user@cassandra.apache.org mailto: user@cassandra.apache.org   
user@cassandra.apache.org mailto: user@cassandra.apache.org  
Date: Friday, March 22, 2013 8:38 AM 
To:  user@cassandra.apache.org mailto: user@cassandra.apache.org   
user@cassandra.apache.org mailto: user@cassandra.apache.org  
Subject: RE: High disk I/O during reads 


Are your Keys spread across all SSTables ? That will cause every sstable read 
which will increase the I/O. 

What compaction are you using ? 

From: zod...@fifth-aeon.net mailto: zod...@fifth-aeon.net  [mailto: 
zod...@fifth-aeon.net ] On Behalf Of Jon Scarborough 

Sent: 21 March 2013 23:00 
To: user@cassandra.apache.org mailto: user@cassandra.apache.org  


Subject: High disk I/O during reads 

Hello, 

We've had a 5-node C* cluster (version 1.1.0) running for several months. Up 
until now we've mostly been writing data, but now we're starting to service 
more read traffic. We're seeing far more disk I/O to service these reads than I 
would have anticipated. 

The CF being queried consists of chat messages. Each row represents a 
conversation between two people. Each column represents a message. The column 
key is composite, consisting of the message date and a few other bits of 
information. The CF is using compression. 

The query is looking for a maximum of 50 messages between two dates, in reverse 
order. Usually the two dates used as endpoints are 30 days ago and the current 
time. The query in Astyanax looks like this: 

ColumnListConversationTextMessageKey result = 
keyspace.prepareQuery(CF_CONVERSATION_TEXT_MESSAGE) 
.setConsistencyLevel(ConsistencyLevel.CL_QUORUM) 
.getKey(conversationKey) 
.withColumnRange( 
textMessageSerializer.makeEndpoint(endDate, Equality.LESS_THAN).toBytes(), 
textMessageSerializer.makeEndpoint(startDate, 
Equality.GREATER_THAN_EQUALS).toBytes(), 
true, 
maxMessages) 
.execute() 
.getResult(); 

We're currently servicing around 30 of

Re: cannot start Cassandra on Windows7

2013-03-22 Thread Wei Zhu

It's there:

http://www.datastax.com/docs/1.2/cluster_architecture/cluster_planning#node-init-config

It's a long document You need to look at the cassandra.yaml and 
cassandra-env.sh and make sure you understand the settings there. 

By the way, did datastax just face lift their document web site? It looks nice.
-Wei

- Original Message -
From: Marina ppi...@yahoo.com
To: user@cassandra.apache.org
Sent: Friday, March 22, 2013 9:18:46 AM
Subject: Re: cannot start Cassandra on Windows7

Jabbar Azam ajazam at gmail.com writes:

 
 
 
 
 
 
 Oops, I also had opscenter installed on my PC.
 My  changes
 log4j-server.properties file
log4j.appender.R.File=c:/var/log/cassandra/system.log
 cassandra.yaml file# directories where Cassandra 
should store data on disk.data_file_directories:    - 
c:/var/lib/cassandra/data# 
commit logcommitlog_directory: c:/var/lib/cassandra/commitlog# saved 
cachessaved_caches_directory: c:/var/lib/cassandra/saved_caches
 I also added an environment variable for windows called CASSANDRA_HOME
 I needed to do this for one of my colleagues and now it's documented ;)
 
 
 
 

Thanks, Jabbar, Victor,
Yes, after I made similar changes I was able to start Cassandra too. 
Would be nice if these instructions were included with the main Cassandra 
documentation/WIKI :)

Thanks!
Marina


 On 22 March 2013 15:47, Jabbar Azam ajazam at gmail.com wrote:
 Viktor, you're right. I didn't get any errors on my windows console but 
cassandra.yaml and log4j-server.properties need modifying.
 
 
 
 On 22 March 2013 15:44, Viktor Jevdokimov Viktor.Jevdokimov at adform.com 
wrote:You NEED to edit cassandra.yaml and log4j-server.properties paths before 
starting on Windows.
 There're a LOT of things to learn for starters. Google for Cassandra on 
Windows.
 Best regards / Pagarbiai
 Viktor Jevdokimov
 Senior Developer
 Email: Viktor.Jevdokimov at adform.com
 Phone: +370 5 212 3063
 Fax: +370 5 261 0453
 J. Jasinskio 16C,
 LT-01112 Vilnius,
 Lithuania
 Disclaimer: The information contained in this message and attachments is 
intended solely for the attention and use of the named addressee and may be 
confidential. If you are not the intended recipient, you are reminded that the 
information remains the property of the sender. You must not use, disclose, 
distribute, copy, print or rely on this e-mail. If you have received this 
message in error, please contact the sender immediately and irrevocably delete 
this message and any copies. -Original Message-
 
 
  From: Marina [mailto:ppine7 at yahoo.com]
  Sent: Friday, March 22, 2013 17:21
  To: user at cassandra.apache.org
  Subject: cannot start Cassandra on Windows7
  Hi,
  I have downloaded apache-cassandra-1.2.3-bin.tar.gz and un-zipped it on my
  Windows7 machine (I did not find a Windows-specific distributable...). 
  Then, 
I
  tried to start Cassandra as following and got an error:
 
  C:\Marina\Tools\apache-cassandra-1.2.3\bincassandra.bat -f Starting
  Cassandra Server Exception in thread main
  java.lang.ExceptionInInitializerError
  Caused by: java.lang.RuntimeException: Couldn't figure out log4j
  configuration:
  log4j-server.properties
          at
  org.apache.cassandra.service.CassandraDaemon.initLog4j(CassandraDaemo
  n.java:81)
          at
  org.apache.cassandra.service.CassandraDaemon.clinit(CassandraDaemon
  .java:57)
  Could not find the main class:
  org.apache.cassandra.service.CassandraDaemon.  Pr ogram will exit.
 
  C:\Marina\Tools\apache-cassandra-1.2.3\bin
 
  It looks similar to the Cassandra issue that was already fixed:
  https://issues.apache.org/jira/browse/CASSANDRA-2383
 
  however I am still getting this error
  I am an Administrator on my machine, and have access to all files in the
  apache- cassandra-1.2.3\conf dir, including the log4j ones.
 
  Do I need to configure anything else on Winows ? I did not find any
  Windows- specific installation/setup/startup instructions - if there are 
such
  documents somewhere, please let me know!
 
  Thanks,
  Marina
 
 
 
 
 
 
 
 -- ThanksJabbar Azam
 
 
 
 
 
 -- ThanksJabbar Azam

Re: Stream fails during repair, two nodes out-of-memory

2013-03-22 Thread Wei Zhu

compaction needs some disk I/O. Slowing down our compaction will improve 
overall system performance. Of course, you don't want to go too slow and fall 
behind too much.

-Wei

- Original Message -
From: Dane Miller d...@optimalsocial.com
To: user@cassandra.apache.org
Cc: Wei Zhu wz1...@yahoo.com
Sent: Friday, March 22, 2013 4:12:56 PM
Subject: Re: Stream fails during repair, two nodes out-of-memory

On Thu, Mar 21, 2013 at 10:28 AM, aaron morton aa...@thelastpickle.com wrote:
 heap of 1867M is kind of small. According to the discussion on this list,
 it's advisable to have m1.xlarge.

 +1

 In cassadrea-env.sh set the MAX_HEAP_SIZE to 4GB, and the NEW_HEAP_SIZE to
 400M

 In the yaml file set

 in_memory_compaction_limit_in_mb to 32
 compaction_throughput_mb_per_sec to 8
 concurrent_compactors to 2

 This will slow down compaction a lot. You may want to restore some of these
 settings once you have things stable.

 You have an under powered box for what you are trying to do.

Thanks very much for the info.  Have made the changes and am retrying.
 I'd like to understand, why does it help to slow compaction?

It does seem like the cluster is under powered to handle our
application's full write load plus repairs, but it operates fine
otherwise.

On Wed, Mar 20, 2013 at 8:47 PM, Wei Zhu wz1...@yahoo.com wrote:
 It's clear you are out of memory. How big is your data size?

120 GB per node, of which 50% is actively written/updated, and 50% is
read-mostly.

Dane

Re: Stream fails during repair, two nodes out-of-memory

2013-03-20 Thread Wei Zhu

;) ).

One obvious reason is administrating a 24 node cluster does add person-time
overhead.

Another reason includes less impact of maintenance activities such as repair,
as these activites have significant CPU overhead. Doubling the cluster size
would, in theory, halve the time for this overhead, but would still impact
performance during that time. Going to xlarge would lessen the impact of
these activities on operations.

Anything else?

Thanks,

-Mike

On Mar 14, 2013, at 9:27 AM, aaron morton wrote:

Because of this I have an unstable cluster and have no other choice than
use Amazon EC2 xLarge instances when we would rather use twice more EC2
Large nodes.
m1.xlarge is a MUCH better choice than m1.large.
You get more ram and better IO and less steal. Using half as many m1.xlarge
is the way to go.

My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to
the max 8 GB (crashing the node).
How is it crashing ?
Are you getting too much GC or running OOM ?
Are you using the default GC configuration ?
Is cassandra logging a lot of GC warnings ?

If you are running OOM then something has to change. Maybe bloom filters,
maybe caches.

Enable the GC logging in cassandra-env.sh to check how low a CMS compaction
get's the heap, or use some other tool. That will give an idea of how much
memory you are using.

Here is some background on what is kept on heap in pre 1.2
http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/03/2013, at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote:

Here is the JIRA I submitted regarding the ancestor.

https://issues.apache.org/jira/browse/CASSANDRA-5342

-Wei

- Original Message -
From: Wei Zhu wz1...@yahoo.com
To: user@cassandra.apache.org
Sent: Wednesday, March 13, 2013 11:35:29 AM
Subject: Re: About the heap

Hi Dean,
The index_interval is controlling the sampling of the SSTable to speed up
the lookup of the keys in the SSTable. Here is the code:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DataTracker.java#L478

To increase the interval meaning, taking less samples, less memory, slower
lookup for read.

I did do a heap dump on my production system which caused about 10 seconds
pause of the node. I found something interesting, for LCS, it could involve
thousands of SSTables for one compaction, the ancestors are recorded in
case something goes wrong during the compaction. But those are never
removed after the compaction is done. In our case, it takes about 1G of
heap memory to store that. I am going to submit a JIRA for that.

Here is the culprit:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java#L58

Enjoy looking at Cassandra code:)

-Wei

- Original Message -
From: Dean Hiller dean.hil...@nrel.gov
To: user@cassandra.apache.org
Sent: Wednesday, March 13, 2013 11:11:14 AM
Subject: Re: About the heap

Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS
which gave us smaller bloomfilters.

As far as key cache. There is an entry in cassandra.yaml called
index_interval set to 128. I am not sure if that is related to key_cache.
I think it is. By turning that to 512 or maybe even 1024, you will consume
less ram there as well though I ran this test in QA and my key cache size
stayed the same so I am really not sure(I am actually checking out
cassandra code now to dig a little deeper into this property.

Dean

From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wednesday, March 13, 2013 10:11 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: About the heap

Hi,

I would like to know everything that is in the heap.

We are here speaking of C*1.1.6

Theory :

- Memtable (1024 MB)
- Key Cache (100 MB)
- Row Cache (disabled, and serialized with JNA activated anyway, so should
be off-heap)
- BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter
Space Used and considering they are showed in Bytes - 1103765112)
- Anything else ?

So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing
slowly (from the new BF of my new data).

My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to
the max 8 GB (crashing the node).

Because of this I have an unstable cluster and have no other choice than
use Amazon EC2 xLarge instances when we would rather use twice more EC2
Large nodes.

What am I missing ?

Practice :

Is there a way not inducing any load and easy

Re: How to configure linux service for Cassandra

2013-03-19 Thread Wei Zhu

Are you looking for something like this

http://www.centos.org/docs/5/html/Deployment_Guide-en-US/s1-services-chkconfig.html

Thanks.
-Wei

- Original Message -
From: Jason Kushmaul | WDA jason.kushm...@wda.com
To: user@cassandra.apache.org user@cassandra.apache.org
Sent: Tuesday, March 19, 2013 5:58:37 AM
Subject: RE: How to configure linux service for Cassandra

I'm not sure about the Cent OS version, but you could utilize the hard work 
that datastax has done with their community edition RPMs,   an init script is 
installed for you.

 


-Original Message-
From: Roshan [mailto:codeva...@gmail.com] 
Sent: Tuesday, March 19, 2013 7:10 AM
To: cassandra-u...@incubator.apache.org
Subject: How to configure linux service for Cassandra

Hi 

I want to start the cassandra as a service. At the moment it is starting as a 
background task.

Cassandra version: 1.0.11
OS: CentOS 5.X

Any help is much appreciated. 

Thanks.



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-to-configure-linux-service-for-Cassandra-tp7586474.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

Re: Recovering from a faulty cassandra node

2013-03-19 Thread Wei Zhu

Hi Dean,
If you are not using VNode and try to replace the node, use the new token as 
old token -1, not +1. The reason is that, the assignment of token is clock wise 
along the ring. If you set your new token to be old token -1, the new node will 
take over all the data of the old node except for one token which was assigned 
to the old node. If you assign new token to be old token + 1, then the new node 
will only streame data of one token. So as a good practice, don't set 0 as your 
node token, start with 100. So it's easier to  go down from 100 than go down 
from 0 (need to caculate 2 ^ 127 - 1)

Hope I didn't confuse you.

-Wei

- Original Message -
From: Dean Hiller dean.hil...@nrel.gov
To: user@cassandra.apache.org
Sent: Tuesday, March 19, 2013 8:25:25 AM
Subject: Re: Recovering from a faulty cassandra node

I have not done this as of yet but from all that I have read your best option 
is to follow the replace node documentation which I belive you need to


 1.  Have the token be the same BUT add 1 to it so it doesn't think it's the 
same computer
 2.  Have the bootstrap option set or something so streaming takes affect.

I would however test that all out in QA to make sure it works and if you have 
QUOROM reads/writes a good part of that test would be to take node X down after 
your node Y is back in the cluster to make sure reads/writes are working on the 
node you fixed…..you just need to make sure node X shares one of the token 
ranges of node Y AND your writes/reads are in that token range.

Dean

From: Jabbar Azam aja...@gmail.commailto:aja...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, March 19, 2013 8:51 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Recovering from a faulty cassandra node

Hello,

I am using Cassandra 1.2.2 on a 4 node test cluster with vnodes. I waited for 
over a week to insert lots of data into the cluster. During the end of the 
process one of the nodes had a hardware fault.

I have fixed the hardware fault but the filing system on that node is corrupt 
so I'll have to reinstall the OS and cassandra.

I can think of two ways of reintegrating the host into the cluster

1) shrink the cluster to three nodes and add the node into the cluster

2) Add the node into the cluster without shrinking

I'm not sure of the best approach to take and I'm not sure how to achieve each 
step.

Can anybody help?


--
Thanks

 Jabbar Azam

Re: Truncate behaviour

2013-03-19 Thread Wei Zhu

There is setting in the cassandra.yaml file which controls that.


# Whether or not a snapshot is taken of the data before keyspace truncation
# or dropping of column families. The STRONGLY advised default of true 
# should be used to provide data safety. If you set this flag to false, you will
# lose data on truncation or drop.
auto_snapshot: true


- Original Message -
From: Víctor Hugo Oliveira Molinar vhmoli...@gmail.com
To: user@cassandra.apache.org
Sent: Tuesday, March 19, 2013 11:50:35 AM
Subject: Truncate behaviour

Hello guys! 
I'm researching the behaviour for truncate operations at cassandra. 


Reading the oficial wiki page( http://wiki.apache.org/cassandra/API ) we can 
understand it as: 

Removes all the rows from the given column family. 


And reading the DataStax page( 
http://www.datastax.com/docs/1.0/references/cql/TRUNCATE ) we can understand it 
as: 
 A TRUNCATE statement results in the immediate, irreversible removal of all 
data in the named column family. 


But I think there is a missing and important point about truncate operations. 
At least at 1.2.0 version, whenever I run a truncate operation, C* 
automatically creates a snapshot file of the column family, resulting in a fake 
free disk space. 

I'm intentionally mentioning 'fake free disk space' because I only figured it 
out when the machine disk space was at high usage. 




- Is it a security C* behaviour of creating snapshots for each CF before 
truncate operation? 
- In my scenario I need to purge my column family data every day. 
I thought that truncate could handle it based at the docs. But it doesnt. 
And since I dont want to manually delete those snapshots, I'd like to know if 
there is a safe and practical way to perform a daily purge of this CF data. 



Thanks in advance!

Re: 13k pending compaction tasks but ZERO running?

2013-03-14 Thread Wei Zhu

Did you restart the node? As I can tell compactions start a few minutes after 
restarting. Did you see a file called $CFName.json ($CFName is your cf name) in 
your data directory? 

-Wei

- Original Message -
From: Dean Hiller dean.hil...@nrel.gov
To: user@cassandra.apache.org
Sent: Thursday, March 14, 2013 8:36:00 AM
Subject: Re: 13k pending compaction tasks but ZERO running?

Duh me.  I forgot to mention I ran nodetool compact k cf and it was
done in 45 seconds and I still had a 36G file(darn, I am usually better
about putting the detail in my emailsŠI thought I added that).  I also
went into JMX and then did it there.

I ran it again and it took 15 seconds.  My script is as follows so that I
know the timesŠit's too bad nodetool doesn't log times for all tools they
have.  If I have time, I should just do a pull request on that.

#!/bin/bash

date  compacttimes.txt
nodetool compact databus5 nreldata  compacttimes.txt
date  compacttimes.txt

Thanks,
Dean

On 3/14/13 9:26 AM, Michael Theroux mthero...@yahoo.com wrote:

Hi Dean,

I saw the same behavior when we switched from STCS to LCS on a couple of
our tables.  Not sure why it doesn't proceed immediately (I pinged the
list, but didn't get any feedback).  However, running nodetool compact
keyspace table got things moving for me.

-Mike

On Mar 14, 2013, at 10:44 AM, Hiller, Dean wrote:

 How do I get my node to run through the 13k pending compaction tasks?
I had to use iptables to take the ring out of the cluster for now and he
is my only node still on STCS.  In cassandra-cli, it shows LCS but on
disk, I see a 36Gig file(ie. Must be STCS still).  How can I get the 13k
pending tasks to start running?

 Nodetool compactionstats Š.
 pending tasks: 13793
 Active compaction remaining time :n/a

 Thanks,
 Dean

Re: 13k pending compaction tasks but ZERO running?

2013-03-14 Thread Wei Zhu

No problem. Back to the old trick, doesn't work, restart:) 

 From: Hiller, Dean dean.hil...@nrel.gov
To: user@cassandra.apache.org user@cassandra.apache.org; Wei Zhu 
wz1...@yahoo.com 
Sent: Thursday, March 14, 2013 9:53 AM
Subject: Re: 13k pending compaction tasks but ZERO running?

Ah, you are a lifesaver.  I was so used to keeping the nodes always up.
That worked  It is finally taking affect.
Thanks,
Dean

On 3/14/13 10:47 AM, Wei Zhu wz1...@yahoo.com wrote:

Did you restart the node? As I can tell compactions start a few minutes
after restarting. Did you see a file called $CFName.json ($CFName is your
cf name) in your data directory?

-Wei

- Original Message -
From: Dean Hiller dean.hil...@nrel.gov
To: user@cassandra.apache.org
Sent: Thursday, March 14, 2013 8:36:00 AM
Subject: Re: 13k pending compaction tasks but ZERO running?

Duh me.  I forgot to mention I ran nodetool compact k cf and it was
done in 45 seconds and I still had a 36G file(darn, I am usually better
about putting the detail in my emailsŠI thought I added that).  I also
went into JMX and then did it there.

I ran it again and it took 15 seconds.  My script is as follows so that I
know the timesŠit's too bad nodetool doesn't log times for all tools they
have.  If I have time, I should just do a pull request on that.

#!/bin/bash

date  compacttimes.txt
nodetool compact databus5 nreldata  compacttimes.txt
date  compacttimes.txt

Thanks,
Dean

On 3/14/13 9:26 AM, Michael Theroux mthero...@yahoo.com wrote:

Hi Dean,

I saw the same behavior when we switched from STCS to LCS on a couple of
our tables.  Not sure why it doesn't proceed immediately (I pinged the
list, but didn't get any feedback).  However, running nodetool compact
keyspace table got things moving for me.

-Mike

On Mar 14, 2013, at 10:44 AM, Hiller, Dean wrote:

 How do I get my node to run through the 13k pending compaction tasks?
I had to use iptables to take the ring out of the cluster for now and he
is my only node still on STCS.  In cassandra-cli, it shows LCS but on
disk, I see a 36Gig file(ie. Must be STCS still).  How can I get the 13k
pending tasks to start running?

 Nodetool compactionstats Š.
 pending tasks: 13793
 Active compaction remaining time :        n/a

 Thanks,
 Dean

Re: About the heap

2013-03-13 Thread Wei Zhu

Hi Dean,
The index_interval is controlling the sampling of the SSTable to speed up the 
lookup of the keys in the SSTable. Here is the code:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DataTracker.java#L478

To increase the interval meaning, taking less samples, less memory, slower 
lookup for read.

I did do a heap dump on my production system which caused about 10 seconds 
pause of the node. I found something interesting, for LCS, it could involve 
thousands of SSTables for one compaction, the ancestors are recorded in case 
something goes wrong during the compaction. But those are never removed after 
the compaction is done. In our case, it takes about 1G of heap memory to store 
that. I am going to submit a JIRA for that. 

Here is the culprit:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java#L58

Enjoy looking at Cassandra code:)

-Wei
 

- Original Message -
From: Dean Hiller dean.hil...@nrel.gov
To: user@cassandra.apache.org
Sent: Wednesday, March 13, 2013 11:11:14 AM
Subject: Re: About the heap

Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS which 
gave us smaller bloomfilters.

As far as key cache.  There is an entry in cassandra.yaml called index_interval 
set to 128.  I am not sure if that is related to key_cache.  I think it is.  By 
turning that to 512 or maybe even 1024, you will consume less ram there as well 
though I ran this test in QA and my key cache size stayed the same so I am 
really not sure(I am actually checking out cassandra code now to dig a little 
deeper into this property.

Dean

From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wednesday, March 13, 2013 10:11 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: About the heap

Hi,

I would like to know everything that is in the heap.

We are here speaking of C*1.1.6

Theory :

- Memtable (1024 MB)
- Key Cache (100 MB)
- Row Cache (disabled, and serialized with JNA activated anyway, so should be 
off-heap)
- BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter 
Space Used and considering they are showed in Bytes - 1103765112)
- Anything else ?

So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing slowly 
(from the new BF of my new data).

My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the 
max 8 GB (crashing the node).

Because of this I have an unstable cluster and have no other choice than use 
Amazon EC2 xLarge instances when we would rather use twice more EC2 Large nodes.

What am I missing ?

Practice :

Is there a way not inducing any load and easy to do to dump the heap to analyse 
it with MAT (or anything else that you could advice) ?

Alain

Re: About the heap

2013-03-13 Thread Wei Zhu

It's not BloomFilter. 

Cassandra will read through sstable index files on start-up, doing what is 
known as index sampling. This is used to keep a subset (currently and by 
default, 1 out of 100) of keys and and their on-disk location in the index, in 
memory. See ArchitectureInternals. This means that the larger the index files 
are, the longer it takes to perform this sampling. Thus, for very large indexes 
(typically when you have a very large number of keys) the index sampling on 
start-up may be a significant issue.

http://wiki.apache.org/cassandra/LargeDataSetConsiderations

-Wei

- Original Message -
From: Alain RODRIGUEZ arodr...@gmail.com
To: user@cassandra.apache.org
Sent: Wednesday, March 13, 2013 11:28:28 AM
Subject: Re: About the heap


 called index_interval set to 128 


I think this is for BloomFilters actually. 



2013/3/13 Hiller, Dean  dean.hil...@nrel.gov  


Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS which 
gave us smaller bloomfilters. 

As far as key cache. There is an entry in cassandra.yaml called index_interval 
set to 128. I am not sure if that is related to key_cache. I think it is. By 
turning that to 512 or maybe even 1024, you will consume less ram there as well 
though I ran this test in QA and my key cache size stayed the same so I am 
really not sure(I am actually checking out cassandra code now to dig a little 
deeper into this property. 

Dean 

From: Alain RODRIGUEZ  arodr...@gmail.com mailto: arodr...@gmail.com  
Reply-To:  user@cassandra.apache.org mailto: user@cassandra.apache.org   
user@cassandra.apache.org mailto: user@cassandra.apache.org  
Date: Wednesday, March 13, 2013 10:11 AM 
To:  user@cassandra.apache.org mailto: user@cassandra.apache.org   
user@cassandra.apache.org mailto: user@cassandra.apache.org  
Subject: About the heap 



Hi, 

I would like to know everything that is in the heap. 

We are here speaking of C*1.1.6 

Theory : 

- Memtable (1024 MB) 
- Key Cache (100 MB) 
- Row Cache (disabled, and serialized with JNA activated anyway, so should be 
off-heap) 
- BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter 
Space Used and considering they are showed in Bytes - 1103765112) 
- Anything else ? 

So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing slowly 
(from the new BF of my new data). 

My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the 
max 8 GB (crashing the node). 

Because of this I have an unstable cluster and have no other choice than use 
Amazon EC2 xLarge instances when we would rather use twice more EC2 Large 
nodes. 

What am I missing ? 

Practice : 

Is there a way not inducing any load and easy to do to dump the heap to analyse 
it with MAT (or anything else that you could advice) ? 

Alain

Re: repair hangs

2013-03-13 Thread Wei Zhu

Do you see anything related to merkle tree in your log?

Also do a nodetool compactionstats, during merkle tree calculation, you will 
see validation there. 

-Wei
- Original Message -
From: Dane Miller d...@optimalsocial.com
To: user@cassandra.apache.org
Sent: Wednesday, March 13, 2013 10:54:50 AM
Subject: repair hangs

Hi,

On one of my nodes, nodetool repair -pr has been running for 48 hours
and appears to be hung, with no output and no AntiEntropy messages in
system.log for 40+ hours.  Load, cpu, etc are all near zero.  There
are no other repair jobs running in my cluster.

What's the recommended way to deal with a hung repair job?  Is it the
symptom of a larger problem?  More info follows...

On the node where the repair is running/hung, nodetool tpstats shows
1 Active and 1 Pending AntiEntropySessions.

nodetool netstats reports Not sending any streams.  Not receiving any streams.

I created this cluster by copying and restoring snapshots from another
cluster.  The new cluster has the same number of nodes and same tokens
as the original.  However, the rack assignment is different: the new
cluster uses a single rack, the original cluster uses multiple racks.
The replication strategy is SimpleStrategy for all keyspaces.

Details:
6 node cluster
cassandra  1.2.2
RandomPartitioner, EC2Snitch
Ubuntu 12.04 x86_64
EC2 m1.large

Thanks,
Dane

Re: About the heap

2013-03-13 Thread Wei Zhu

Here is the JIRA I submitted regarding the ancestor.

https://issues.apache.org/jira/browse/CASSANDRA-5342

-Wei

- Original Message -
From: Wei Zhu wz1...@yahoo.com
To: user@cassandra.apache.org
Sent: Wednesday, March 13, 2013 11:35:29 AM
Subject: Re: About the heap

Hi Dean,
The index_interval is controlling the sampling of the SSTable to speed up the 
lookup of the keys in the SSTable. Here is the code:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/db/DataTracker.java#L478

To increase the interval meaning, taking less samples, less memory, slower 
lookup for read.

I did do a heap dump on my production system which caused about 10 seconds 
pause of the node. I found something interesting, for LCS, it could involve 
thousands of SSTables for one compaction, the ancestors are recorded in case 
something goes wrong during the compaction. But those are never removed after 
the compaction is done. In our case, it takes about 1G of heap memory to store 
that. I am going to submit a JIRA for that. 

Here is the culprit:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/SSTableMetadata.java#L58

Enjoy looking at Cassandra code:)

-Wei

- Original Message -
From: Dean Hiller dean.hil...@nrel.gov
To: user@cassandra.apache.org
Sent: Wednesday, March 13, 2013 11:11:14 AM
Subject: Re: About the heap

Going to 1.2.2 helped us quite a bit as well as turning on LCS from STCS which 
gave us smaller bloomfilters.

As far as key cache.  There is an entry in cassandra.yaml called index_interval 
set to 128.  I am not sure if that is related to key_cache.  I think it is.  By 
turning that to 512 or maybe even 1024, you will consume less ram there as well 
though I ran this test in QA and my key cache size stayed the same so I am 
really not sure(I am actually checking out cassandra code now to dig a little 
deeper into this property.

Dean

From: Alain RODRIGUEZ arodr...@gmail.commailto:arodr...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wednesday, March 13, 2013 10:11 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: About the heap

Hi,

I would like to know everything that is in the heap.

We are here speaking of C*1.1.6

Theory :

- Memtable (1024 MB)
- Key Cache (100 MB)
- Row Cache (disabled, and serialized with JNA activated anyway, so should be 
off-heap)
- BloomFilters (about 1,03 GB - from cfstats, adding all the Bloom Filter 
Space Used and considering they are showed in Bytes - 1103765112)
- Anything else ?

So my heap should be fluctuating between 1,15 GB and 2.15 GB and growing slowly 
(from the new BF of my new data).

My heap is actually changing from 3-4 GB to 6 GB and sometimes growing to the 
max 8 GB (crashing the node).

Because of this I have an unstable cluster and have no other choice than use 
Amazon EC2 xLarge instances when we would rather use twice more EC2 Large nodes.

What am I missing ?

Practice :

Is there a way not inducing any load and easy to do to dump the heap to analyse 
it with MAT (or anything else that you could advice) ?

Alain

Re: repair hangs

2013-03-13 Thread Wei Zhu

My guess would be there is some exception during the repair and your session is 
aborted. Here is the code of doing repair:

https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/AntiEntropyService.java

looking for 

logger.info

Compare that with your log file, it should give you a rough idea in which stage 
repaired died.

-Wei

- Original Message -
From: Dane Miller d...@optimalsocial.com
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com
Sent: Wednesday, March 13, 2013 12:32:20 PM
Subject: Re: repair hangs

On Wed, Mar 13, 2013 at 11:44 AM, Wei Zhu wz1...@yahoo.com wrote:
Do you see anything related to merkle tree in your log?

Also do a nodetool compactionstats, during merkle tree calculation, you will 
see
validation there.

The last mention of merkle is 2 days old.  compactionstats are:

$ nodetool compactionstats
pending tasks: 7
Active compaction remaining time :n/a

Does this help explain anything?

Dane

Re: Size Tiered - Leveled Compaction

2013-03-08 Thread Wei Zhu

I have the same wonder. 
We started with the default 5M and the compaction after repair takes too long 
on 200G node, so we increase the size to 10M sort of arbitrarily since there is 
not much documentation around it. Our tech op team still thinks there are too 
many files in one directory. To fulfill the guidelines from them (don't 
remember the exact number, but something in the range of 50K files), we will 
need to increase the size to around 50M. I think the latency of  opening one 
file is not impacted much by the number of files in one directory for the 
modern file system. But ls and other operations suffer.

Anyway, I asked about the side effect of the bigger SSTable in IRC, someone was 
mentioning during read, C* reads the whole SSTable from disk in order to access 
the row which causes more disk IO compared with the smaller SSTable. I don't 
know enough about the internal of the Cassandra, not sure whether it's the case 
or not. If that is the case (with question mark) , the SSTable or the row is 
kept in the memory? Hope someone can confirm the theory here. Or I have to dig 
in to the source code to find it. 

Another concern is during repair, does it stream the whole SSTable or the 
partial of it when mismatch is detected? I see the claim for both, can someone 
please confirm also?

The last thing is the effectiveness of the parallel LCS on 1.2. It takes quite 
some time for the compaction to finish after repair for LCS for 1.1.X. Both CPU 
and disk Util is low during the compaction which means LCS doesn't fully 
utilized resource.  It will make the life easier if the issue is addressed in 
1.2. 

Bottom line is that there is not much documentation/guideline/successful story 
around LCS although it sounds beautiful on paper.

Thanks.
-Wei


 From: Alain RODRIGUEZ arodr...@gmail.com
To: user@cassandra.apache.org 
Cc: Wei Zhu wz1...@yahoo.com 
Sent: Friday, March 8, 2013 1:25 AM
Subject: Re: Size Tiered - Leveled Compaction
 

I'm still wondering about how to chose the size of the sstable under LCS. 
Defaul is 5MB, people use to configure it to 10MB and now you configure it at 
128MB. What are the benefits or inconveniants of a very small size (let's say 5 
MB) vs big size (like 128MB) ?

Alain



2013/3/8 Al Tobey a...@ooyala.com

We saw the exactly the same thing as Wei Zhu,  100k tables in a directory 
causing all kinds of issues.  We're running 128MiB ssTables with LCS and have 
disabled compaction throttling.  128MiB was chosen to get file counts under 
control and reduce the number of files C* has to manage  search. I just looked 
and a ~250GiB node is using about 10,000 files, which is quite manageable.  
This configuration is running smoothly in production under mixed read/write 
load.


We're on RAID0 across 6 15k drives per machine. When we migrated data to this 
cluster we were pushing well over 26k/s+ inserts with CL_QUORUM. With 
compaction throttling enabled at any rate it just couldn't keep up. With 
throttling off, it runs smoothly and does not appear to have an impact on our 
applications, so we always leave it off, even in EC2.  An 8GiB heap is too 
small for this config on 1.1. YMMV.

-Al Tobey


On Thu, Feb 14, 2013 at 12:51 PM, Wei Zhu wz1...@yahoo.com wrote:

I haven't tried to switch compaction strategy. We started with LCS. 


For us, after massive data imports (5000 w/seconds for 6 days), the first 
repair is painful since there is quite some data inconsistency. For 150G 
nodes, repair brought in about 30 G and created thousands of pending 
compactions. It took almost a day to clear those. Just be prepared LCS is 
really slow in 1.1.X. System performance degrades during that time since 
reads could go to more SSTable, we see 20 SSTable lookup for one read.. (We 
tried everything we can and couldn't speed it up. I think it's single 
threaded and it's not recommended to turn on multithread compaction. We 
even tried that, it didn't help )There is parallel LCS in 1.2 which is 
supposed to alleviate the pain. Haven't upgraded yet, hope it works:)


http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2





Since our cluster is not write intensive, only 100 w/seconds. I don't see any 
pending compactions during regular operation. 


One thing worth mentioning is the size of the SSTable, default is 5M which is 
kind of small for 200G (all in one CF) data set, and we are on SSD.  It more 
than  150K files in one directory. (200G/5M = 40K SSTable and each SSTable 
creates 4 files on disk)  You might want to watch that and decide the SSTable 
size. 


By the way, there is no concept of Major compaction for LCS. Just for fun, 
you can look at a file called $CFName.json in your data directory and it 
tells you the SSTable distribution among different levels. 


-Wei




 From: Charles Brophy cbro...@zulily.com
To: user@cassandra.apache.org 
Sent: Thursday, February 14, 2013 8:29 AM
Subject: Re

Re: should I file a bug report on this or is this normal?

2013-03-07 Thread Wei Zhu

It seems to be normal to explode data size during repair. For our case, we have 
a node around 200G with RF =3, during repair, it goes to as high as 300G. We 
are using LCS, it creates more than 5000 compaction tasks and takes more than a 
day to finish. We are on 1.1.6

There is parallel LCS feature on 1.2, it is supposed to speed up the LCS. Let 
us know how it goes for you since you are using LCS on 1.2

Also there are a few JIRAs related to this issue:

https://issues.apache.org/jira/browse/CASSANDRA-2698
https://issues.apache.org/jira/browse/CASSANDRA-3721


Thanks.
-Wei

- Original Message -
From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org
Sent: Wednesday, March 6, 2013 8:29:16 AM
Subject: Re: should I file a bug report on this or is this normal?



15. Size of nreldata is now 220K ….it has exploded in size!! 
This may be explained by fragmentation in the sstables, which compaction would 
eventually resolve. 


During repair the data came from multiple nodes and created multiple sstables 
for each CF. Streaming copies part of an SSTable on the source and creates an 
SSTable on the destination. This pattern is different to all writes for a CF 
going to the same sstable when flushed. 


To compare apples to apples run a major compaction after the initial data load, 
and after the repair. 



1. Why is the bloomfilter for level 5 a total of 3856 bytes for 29118(large to 
small) bytes of data while in the initial data it was 2192 bytes for 
43038(small to large) bytes of data? 
The size of the BF depends on the number of rows and the false positive rate. 
Not the size of the -Data.db component on disk. 



2. Why is there 3 levels? With such a small set of data, I would think it would 
flush one data file like the original data but instead there is 3 files. 
See above. 


Cheers 








- 
Aaron Morton 
Freelance Cassandra Developer 
New Zealand 


@aaronmorton 
http://www.thelastpickle.com 


On 6/03/2013, at 6:40 AM, Hiller, Dean  dean.hil...@nrel.gov  wrote: 


I ran a pretty solid QA test(cleaned data from scratch) on version 1.2.2 

My test was as so 

1. Start up 4 node cassandra cluster 
2. Populate with initial test data (no other data is added to system after this 
point!!!) 
3. Run nodetool drain on every node(move stuff from commit log to sstables) 
4. Stop and start cassandra cluster to have it running again 
5. Get size of nreldata CF folder is 128kB 
6. Go to node 3, run snapshot and mv snapshots directory OUT of nreldata 
7. Get size of nreldata CF folder is 128kB 
8. On node 3, run nodetool drain 
9. Get size of nreldataCF folder is still 128kB 
10. Stop cassandra node 
11. Rm keyspace/nreldata/*.db 
12. Size of nreldata CF is 8kb(odd of an empty folder but ok) 
13. Start cassandra 
14. Nodetool repair databus5 nreldata 
15. Size of nreldata is now 220K ….it has exploded in size!! 

I ran this QA test as we see data size explosion in production as well(I can't 
be 100% sure if this is the same thing though as above is such a small data 
set). Would leveled compaction be a bit more stable in terms of size ratios and 
such. 

QUESTIONS 

1. Why is the bloomfilter for level 5 a total of 3856 bytes for 29118(large to 
small) bytes of data while in the initial data it was 2192 bytes for 
43038(small to large) bytes of data? 
2. Why is there 3 levels? With such a small set of data, I would think it would 
flush one data file like the original data but instead there is 3 files. 

My files after repair have levels 5, 6, and 7. My files before deletion of the 
CF have just level 1. After repair files are 
-rw-rw-r--. 1 cassandra cassandra 54 Mar 6 07:18 
databus5-nreldata-ib-5-CompressionInfo.db 
-rw-rw-r--. 1 cassandra cassandra 29118 Mar 6 07:18 
databus5-nreldata-ib-5-Data.db 
-rw-rw-r--. 1 cassandra cassandra 3856 Mar 6 07:18 
databus5-nreldata-ib-5-Filter.db 
-rw-rw-r--. 1 cassandra cassandra 37000 Mar 6 07:18 
databus5-nreldata-ib-5-Index.db 
-rw-rw-r--. 1 cassandra cassandra 4772 Mar 6 07:18 
databus5-nreldata-ib-5-Statistics.db 
-rw-rw-r--. 1 cassandra cassandra 383 Mar 6 07:18 
databus5-nreldata-ib-5-Summary.db 
-rw-rw-r--. 1 cassandra cassandra 79 Mar 6 07:18 databus5-nreldata-ib-5-TOC.txt 
-rw-rw-r--. 1 cassandra cassandra 46 Mar 6 07:18 
databus5-nreldata-ib-6-CompressionInfo.db 
-rw-rw-r--. 1 cassandra cassandra 14271 Mar 6 07:18 
databus5-nreldata-ib-6-Data.db 
-rw-rw-r--. 1 cassandra cassandra 816 Mar 6 07:18 
databus5-nreldata-ib-6-Filter.db 
-rw-rw-r--. 1 cassandra cassandra 18248 Mar 6 07:18 
databus5-nreldata-ib-6-Index.db 
-rw-rw-r--. 1 cassandra cassandra 4756 Mar 6 07:18 
databus5-nreldata-ib-6-Statistics.db 
-rw-rw-r--. 1 cassandra cassandra 230 Mar 6 07:18 
databus5-nreldata-ib-6-Summary.db 
-rw-rw-r--. 1 cassandra cassandra 79 Mar 6 07:18 databus5-nreldata-ib-6-TOC.txt 
-rw-rw-r--. 1 cassandra cassandra 46 Mar 6 07:18 
databus5-nreldata-ib-7-CompressionInfo.db 
-rw-rw-r--. 1 cassandra cassandra 14271 Mar 6 07:18

Re: Write latency spikes

2013-03-07 Thread Wei Zhu

If you are tight about your SLA, try set socketTimeout from Hector with small 
number so that it can retry faster given the assumption that your write is 
idempotent.

Regarding your write latency, don't have much insight. We see spike on the 
reads due to GC/compaction etc. But not write latency.

Thanks.
-Wei

- Original Message -
From: Jouni Hartikainen jouni.hartikai...@reaktor.fi
To: user@cassandra.apache.org
Sent: Wednesday, March 6, 2013 10:44:49 PM
Subject: Write latency spikes

Hi all,

I'm experiencing strange latency spikes when writing and trying to figure out 
what could cause them.

My setup:
- 3 nodes, writing at CL.ONE using Hector client, no reads
- Writing simultaneously to 3 CFs, inserts with 25h TTL, no deletes, no 
updates, RF 3
   - 2 CFs have small data (row count  2000, row size  500kB, column 
count/row  15 000)
   - 1 CF has lots of binary data split into ~60kB columns (row count  550 
000, row sizes  2MB, column count/row  40)
   - Write rate ~300 inserts / s for each CF, total write throughput ~25 MB 
(bytes) / second
   - data is time series using timestamp as column key
- Cassandra 1.2.2 with 256 vnodes on each machine
- Key cache at default 100MB, no row cache
- 1 x Xeon L5430 CPU, 16GB RAM, 2.3T disc on RAID10 (10k SAS), Sun/Oracle JDK 
1.6 (tried also 1.7), 4GB JVM heap, JNA enabled
- all nodes in the same DC, 1Gb network, sub ms latencies between nodes

cassandra.yaml: http://pastebin.com/MSr2prpb
cfstats: http://pastebin.com/Ax5vPUcY
example cfhistograms: http://pastebin.com/qYSL1MX3
example proxy histograms: http://pastebin.com/X3AGGEjh

With this setup I usually get quite nice write latencies of less than 20ms, but 
sometimes (~once in a every few minutes) latencies momentarily spike to more 
than 300ms maxing out at ~2.5 seconds. Spikes are short ( 1 s) and happen on 
all nodes (but not at the same time). Even if avg latencies are very good, 
these spikes cause us headaches due to our SLA.

While investigating I have learned the following:
- No evident GC pressure (nothing in C* logs, GC logging showing constantly  
30ms collection pauses)
- No I/O bounds (disks provide ~1GB/s linear write and are mostly idle apart 
from memtable flushes for every ~11s)
- No relation between spikes  compaction
- No queuing in memtable FlushWriter, no blocked memtable flushes
- Nothing alarming in logs
- No timeouts, no errors on the client side
- Each client (3 separate machines) experience latencies simultaneously which 
points to cause being in C*, not in the client
- CPU load  10% ( 20% while compacting)
- Latencies measured both from the client and observed using nodetool 
cfhistograms

Now I'm running out of ideas about what might cause the spikes as I have 
understood that there is really not that many places on the write path that 
could block.

Any ideas?

-Jouni

Re: Bloom filters and LCS

2013-03-07 Thread Wei Zhu

Where did you read that bloom filters are off for LCS on 1.1.9?

Those are the two issues I can find regarding this matter:

https://issues.apache.org/jira/browse/CASSANDRA-4876
https://issues.apache.org/jira/browse/CASSANDRA-5029

Looks like in 1.2, it defaults at 0.1, not sure about 1.1.X

-Wei

- Original Message -
From: Michael Theroux mthero...@yahoo.com
To: user@cassandra.apache.org
Sent: Thursday, March 7, 2013 1:18:38 PM
Subject: Bloom filters and LCS

Hello,

(Hopefully) Quick question.

We are running Cassandra 1.1.9.

I recently converted some tables from Size tiered to Leveled Compaction.  The 
amount of space for Bloom Filters on these tables went down tremendously (which 
is expected, LCS in 1.1.9 does not use bloom filters). 

However, although its far less, its still using a number of megabytes.  Why is 
it not zero?


Column Family: 
SSTable count: 526
Space used (live): 7251063348
Space used (total): 7251063348
Number of Keys (estimate): 23895552
Memtable Columns Count: 45719
Memtable Data Size: 21207173
Memtable Switch Count: 579
Read Count: 21773431
Read Latency: 4.155 ms.
Write Count: 16183367
Write Latency: 0.029 ms.
Pending Tasks: 0
Bloom Filter False Positives: 2442
Bloom Filter False Ratio: 0.00245
Bloom Filter Space Used: 44674656
Compacted row minimum size: 73
Compacted row maximum size: 105778
Compacted row mean size: 1104

Thanks,
-Mike

Re: Cassandra instead of memcached

2013-03-06 Thread Wei Zhu

It also depends on you SLA, it should work for 99% of the time. But one 
GC/flush/compact could screw things up big time if you have tight SLA.

-Wei

 From: Drew Kutcharian d...@venarc.com
To: user@cassandra.apache.org 
Sent: Wednesday, March 6, 2013 9:32 AM
Subject: Re: Cassandra instead of memcached

I think the dataset should fit in memory easily. The main purpose of this would 
be as a store for an API rate limiting/accounting system. I think ebay guys are 
using C* too for the same reason. Initially we were thinking of using Hazelcast 
or memcahed. But Hazelcast (at least the community edition) has Java gc issues 
with big heaps and the problem with memcached is lack of a reliable 
distribution (you lose a node, you need to rehash everything), so I figured why 
not just use C*.

On Mar 6, 2013, at 9:08 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

If your writing much more data then RAM cassandra will not work as fast as 
memcache. Cassandra is not magical, if all of your data fits in memory it is 
going to be fast, if most of your data fits in memory it can still be fast. 
However if you plan on having much more data then disk you need to think about 
more RAM and OR SSD disks.

We do not use c* as an in-memory store. However for many of our datasets we 
do not have a separate caching tier. In those cases cassandra is both our 
database and our in-memory store if you want to use those terms :)

On Wed, Mar 6, 2013 at 12:02 PM, Drew Kutcharian d...@venarc.com wrote:

Thanks guys, this is what I was looking for.

@Edward. I definitely like crazy ideas ;), I think the only issue here is 
that C* is a disk space hug, so not sure if that would be feasible since free 
RAM is not as abundant as disk. BTW, I watched your presentation, are you 
guys still using C* as in-memory store?

On Mar 6, 2013, at 7:44 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

http://www.slideshare.net/edwardcapriolo/cassandra-as-memcache

Read at ONE.
READ_REPAIR_CHANCE as low as possible.

Use short TTL and short GC_GRACE.

Make the in memory memtable size as high as possible to avoid flushing and 
compacting.

Optionally turn off commit log.

You can use cassandra like memcache but it is not a memcache replacement. 
Cassandra persists writes and compacts SSTables, memcache only has to keep 
data in memory.

If you want to try a crazy idea. try putting your persistent data on a ram 
disk! Not data/system however!

On Wed, Mar 6, 2013 at 2:45 AM, aaron morton aa...@thelastpickle.com wrote:

consider disabling durable_writes in the KS config to remove writing to the 
commit log. That will speed things up for you. Note that you risk losing 
data is cassandra crashes or is not shut down with nodetool drain. 

Even if you set the gc_grace to 0, deletes will still need to be committed 
to disk. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com/

On 5/03/2013, at 9:51 AM, Drew Kutcharian d...@venarc.com wrote:

Thanks Ben, that article was actually the reason I started thinking about 
removing memcached.

I wanted to see what would be the optimum config to use C* as an in-memory 
store.

-- Drew

On Mar 5, 2013, at 2:39 AM, Ben Bromhead b...@instaclustr.com wrote:

Check out 
http://techblog.netflix.com/2012/07/benchmarking-high-performance-io-with.html

Netflix used Cassandra with SSDs and were able to drop their memcache 
layer. Mind you they were not using it purely as an in memory KV store.

Ben
Instaclustr | www.instaclustr.com | @instaclustr

On 05/03/2013, at 4:33 PM, Drew Kutcharian d...@venarc.com wrote:

Hi Guys,

I'm thinking about using Cassandra as an in-memory key/value store 
instead of memcached for a new project (just to get rid of a dependency 
if possible). I was thinking about setting the replication factor to 1, 
enabling off-heap row-cache and setting gc_grace_period to zero for the 
CF that will be used for the key/value store.

Has anyone tried this? Any comments?

Thanks,

Drew

Re: Poor read latency

2013-03-05 Thread Wei Zhu

According to this:

https://issues.apache.org/jira/browse/CASSANDRA-5029


Bloom filter is still on by default for LCS in 1.2.X

Thanks.
-Wei



 From: Hiller, Dean dean.hil...@nrel.gov
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Monday, March 4, 2013 10:42 AM
Subject: Re: Poor read latency
 
Recommended settings are 8G RAM and your memory grows with the number of rows 
through index samples(configured in cassandra.yaml as samples per row 
something…look for the word index).  Also, bloomfilters grow with RAM if using 
size tiered compaction.  We are actually trying to switch to leveled compaction 
in 1.2.2 as I think the default is no bloomfilters as LCS does not really 
need them I think since 90% of rows are in highest tier(but this just works 
better for certain type profiles like very heavy read vs. the number of writes).

Later,
Dean

From: Tom Martin tompo...@gmail.commailto:tompo...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Monday, March 4, 2013 11:20 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Poor read latency

Yeah, I just checked and the heap size 0.75 warning has been appearing.

nodetool info reports:

Heap Memory (MB) : 563.88 / 1014.00
Heap Memory (MB) : 646.01 / 1014.00
Heap Memory (MB) : 639.71 / 1014.00

We have plenty of free memory on each instance.  Do we need bigger instances or 
should we just configure each node to have a bigger max heap?


On Mon, Mar 4, 2013 at 6:10 PM, Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote:
What is nodetool info say for your memory?  (we hit that one with memory near 
the max and it slowed down our system big time…still working on resolving it 
too).

Do any logs have the hit 0.75, running compaction OR worse hit 0.85 running 
compaction….you get that if the above is the case typically.

Dean

From: Tom Martin 
tompo...@gmail.commailto:tompo...@gmail.commailto:tompo...@gmail.commailto:tompo...@gmail.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Monday, March 4, 2013 10:31 AM
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Poor read latency

Hi all,

We have a small (3 node) cassandra cluster on aws.  We have a replication 
factor of 3, a read level of local_quorum and are using the ephemeral disk.  
We're getting pretty poor read performance and quite high read latency in 
cfstats.  For example:

Column Family: AgentHotel
SSTable count: 4
Space used (live): 829021175
Space used (total): 829021175
Number of Keys (estimate): 2148352
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 67204
Read Latency: 23.813 ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 50
Bloom Filter False Ratio: 0.00201
Bloom Filter Space Used: 7635472
Compacted row minimum size: 259
Compacted row maximum size: 4768
Compacted row mean size: 873

For comparison we have a similar set up in another cluster for an old project 
(hosted on rackspace) where we're getting sub 1ms read latencies.  We are using 
multigets on the client (Hector) but are only requesting ~40 rows per request 
on average.

I feel like we should reasonably expect better performance but perhaps I'm 
mistaken.  Is there anything super obvious we should be checking out?

Re: what size file for LCS is best for 300-500G per node?

2013-03-04 Thread Wei Zhu

We have 200G and ended going with 10M. The compaction after repair takes a day 
to finish. Try to run a repair and see how it goes.

-Wei

- Original Message -
From: Dean Hiller dean.hil...@nrel.gov
To: user@cassandra.apache.org
Sent: Monday, March 4, 2013 10:52:27 AM
Subject: what size file for LCS is best for 300-500G per node?

Should we really be going with 5MB when it compresses to 3MB?  That seems to be 
on the small side, right?  We have ulimit cranked up so many files shouldn't 
be an issue but maybe we should go to 10MB or 100MB or something in between?  
Does anyone have any experience with changing the LCS sizes?

I do read somewhere startup times of opening 100,000 files could be slow? 
Which implies a larger size so less files might be better?

Thanks,
Dean

Re: Mutation dropped

2013-02-21 Thread Wei Zhu

Thanks Aaron for the great information as always. I just checked cfhistograms 
and only a handful of read latency are bigger than 100ms, but for 
proxyhistograms there are 10 times more are greater than 100ms. We are using 
QUORUM  for reading with RF=3, and I understand coordinator needs to get the 
digest from other nodes and read repair on the miss match etc. But is it normal 
to see the latency from proxyhistograms to go beyond 100ms? Is there anyway to 
improve that? 
We are tracking the metrics from Client side and we see the 95th percentile 
response time averages at 40ms which is a bit high. Our 50th percentile was 
great under 3ms. 

Any suggestion is very much appreciated.

Thanks.
-Wei

- Original Message -
From: aaron morton aa...@thelastpickle.com
To: Cassandra User user@cassandra.apache.org
Sent: Thursday, February 21, 2013 9:20:49 AM
Subject: Re: Mutation dropped

 What does rpc_timeout control? Only the reads/writes? 
Yes. 

 like data stream,
streaming_socket_timeout_in_ms in the yaml

 merkle tree request? 
Either no time out or a number of days, cannot remember which right now. 

 What is the side effect if it's set to a really small number, say 20ms?
You will probably get a lot more requests that fail with a TimedOutException. 

rpc_timeout needs to be longer than the time it takes a node to process the 
message, and the time it takes the coordinator to do it's thing. You can look 
at cfhistograms and proxyhistograms to get a better idea of how long a request 
takes in your system.  
  
Cheers

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 21/02/2013, at 6:56 AM, Wei Zhu wz1...@yahoo.com wrote:

 What does rpc_timeout control? Only the reads/writes? How about other 
 inter-node communication, like data stream, merkle tree request?  What is the 
 reasonable value for roc_timeout? The default value of 10 seconds are way too 
 long. What is the side effect if it's set to a really small number, say 20ms?
 
 Thanks.
 -Wei
 
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org 
 Sent: Tuesday, February 19, 2013 7:32 PM
 Subject: Re: Mutation dropped
 
 Does the rpc_timeout not control the client timeout ?
 No it is how long a node will wait for a response from other nodes before 
 raising a TimedOutException if less than CL nodes have responded. 
 Set the client side socket timeout using your preferred client. 
 
 Is there any param which is configurable to control the replication timeout 
 between nodes ?
 There is no such thing.
 rpc_timeout is roughly like that, but it's not right to think about it that 
 way. 
 i.e. if a message to a replica times out and CL nodes have already responded 
 then we are happy to call the request complete. 
 
 Cheers
 
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
 
 @aaronmorton
 http://www.thelastpickle.com
 
 On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 Thanks Aaron.
  
 Does the rpc_timeout not control the client timeout ? Is there any param 
 which is configurable to control the replication timeout between nodes ? Or 
 the same param is used to control that since the other node is also like a 
 client ?
  
  
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: 17 February 2013 11:26
 To: user@cassandra.apache.org
 Subject: Re: Mutation dropped
  
 You are hitting the maximum throughput on the cluster. 
  
 The messages are dropped because the node fails to start processing them 
 before rpc_timeout. 
  
 However the request is still a success because the client requested CL was 
 achieved. 
  
 Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
 Both nodes replicate each row, and writes are sent to each replica, so the 
 only thing the client is waiting on is the local node to write to it's 
 commit log. 
  
 Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
 scenario. 
  
 Cheers
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand
  
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:
 
 
 Hi – Is there a parameter which can be tuned to prevent the mutations from 
 being dropped ? Is this logic correct ?
  
 Node A and B with RF=2, CL =1. Load balanced between the two.
  
 --  Address   Load   Tokens  Owns (effective)  Host ID   
 Rack
 UN  10.x.x.x   746.78 GB  256 100.0%
 dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
 UN  10.x.x.x   880.77 GB  256 100.0%
 95d59054-be99-455f-90d1-f43981d3d778  rack1
  
 Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
 falling behind and we see the mutation dropped messages. But there are no 
 failures on the client. Does that mean other node is not able to persist the 
 replicated data ? Is there some

Re: Mutation dropped

2013-02-20 Thread Wei Zhu

What does rpc_timeout control? Only the reads/writes? How about other 
inter-node communication, like data stream, merkle tree request?  What is the 
reasonable value for roc_timeout? The default value of 10 seconds are way too 
long. What is the side effect if it's set to a really small number, say 20ms?

Thanks.
-Wei



 From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org 
Sent: Tuesday, February 19, 2013 7:32 PM
Subject: Re: Mutation dropped
 

Does the rpc_timeout not control the client timeout ?No it is how long a node 
will wait for a response from other nodes before raising a TimedOutException if 
less than CL nodes have responded. 
Set the client side socket timeout using your preferred client. 

Is there any param which is configurable to control the replication timeout 
between nodes ?There is no such thing.
rpc_timeout is roughly like that, but it's not right to think about it that 
way. 
i.e. if a message to a replica times out and CL nodes have already responded 
then we are happy to call the request complete. 

Cheers

 

-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/02/2013, at 1:48 AM, Kanwar Sangha kan...@mavenir.com wrote:

Thanks Aaron.
 
Does the rpc_timeout not control the client timeout ? Is there any param which 
is configurable to control the replication timeout between nodes ? Or the same 
param is used to control that since the other node is also like a client ?
 
 
 
From: aaron morton [mailto:aa...@thelastpickle.com] 
Sent: 17 February 2013 11:26
To: user@cassandra.apache.org
Subject: Re: Mutation dropped
 
You are hitting the maximum throughput on the cluster. 
 
The messages are dropped because the node fails to start processing them 
before rpc_timeout. 
 
However the request is still a success because the client requested CL was 
achieved. 
 
Testing with RF 2 and CL 1 really just tests the disks on one local machine. 
Both nodes replicate each row, and writes are sent to each replica, so the 
only thing the client is waiting on is the local node to write to it's commit 
log. 
 
Testing with (and running in prod) RF3 and CL QUROUM is a more real world 
scenario. 
 
Cheers
 
-
Aaron Morton
Freelance Cassandra Developer
New Zealand
 
@aaronmorton
http://www.thelastpickle.com
 
On 15/02/2013, at 9:42 AM, Kanwar Sangha kan...@mavenir.com wrote:



Hi – Is there a parameter which can be tuned to prevent the mutations from 
being dropped ? Is this logic correct ?
 
Node A and B with RF=2, CL =1. Load balanced between the two.
 
--  Address   Load   Tokens  Owns (effective)  Host ID 
  Rack
UN  10.x.x.x   746.78 GB  256 100.0%    
dbc9e539-f735-4b0b-8067-b97a85522a1a  rack1
UN  10.x.x.x   880.77 GB  256 100.0%    
95d59054-be99-455f-90d1-f43981d3d778  rack1
 
Once we hit a very high TPS (around 50k/sec of inserts), the nodes start 
falling behind and we see the mutation dropped messages. But there are no 
failures on the client. Does that mean other node is not able to persist the 
replicated data ? Is there some timeout associated with replicated data 
persistence ?
 
Thanks,
Kanwar
 
 
 
 
 
 
 
From: Kanwar Sangha [mailto:kan...@mavenir.com] 
Sent: 14 February 2013 09:08
To: user@cassandra.apache.org
Subject: Mutation dropped
 
Hi – I am doing a load test using YCSB across 2 nodes in a cluster and seeing 
a lot of mutation dropped messages.  I understand that this is due to the 
replica not being written to the
other node ? RF = 2, CL =1.
 
From the wiki -
For MUTATION messages this means that the mutation was not applied to all 
replicas it was sent to. The inconsistency will be repaired by Read Repair or 
Anti Entropy Repair
 
Thanks,
Kanwar

Re: cassandra vs. mongodb quick question(good additional info)

2013-02-19 Thread Wei Zhu

From my limited experience with Mongo, it seems that Mongo only performs when 
the whole data set is in the memory which makes me wonder how the 40TB data 
works..

- Original Message -
From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org
Sent: Tuesday, February 19, 2013 7:02:56 AM
Subject: Re: cassandra vs. mongodb quick question(good additional info)

The 40 TB use case you heard about is probably one 40TB mysql machine
that someone migrated to mongo so it would be web scale Cassandra is
NOT good with drives that big, get a blade center or a high density
chassis.

On Mon, Feb 18, 2013 at 8:00 PM, Hiller, Dean dean.hil...@nrel.gov wrote:
 I thought about this more, and even with a 10Gbit network, it would take 40 
 days to bring up a replacement node if mongodb did truly have a 42T / node 
 like I had heard.  I wrote the below email to the person I heard this from 
 going back to basics which really puts some perspective on it….(and a lot of 
 people don't even have a 10Gbit network like we do)

 Nodes are hooked up by a 10G network at most right now where that is 
 10gigabit.  We are talking about 10Terabytes on disk per node recently.

 Google 10 gigabit in gigabytes gives me 1.25 gigabytes/second  (yes I could 
 have divided by 8 in my head but eh…course when I saw the number, I went duh)

 So trying to transfer 10 Terabytes  or 10,000 Gigabytes to a node that we are 
 bringing online to replace a dead node would take approximately 5 days???

 This means no one else is using the bandwidth too ;).  10,000Gigabytes * 1 
 second/1.25 * 1hr/60secs * 1 day / 24 hrs = 5.55 days.  This is more 
 likely 11 days if we only use 50% of the network.

 So bringing a new node up to speed is more like 11 days once it is crashed.  
 I think this is the main reason the 1Terabyte exists to begin with, right?

 From an ops perspective, this could sound like a nightmare scenario of 
 waiting 10 days…..maybe it is livable though.  Either way, I thought it would 
 be good to share the numbers.  ALSO, that is assuming the bus with it's 10 
 disk can keep up with 10G  Can it?  What is the limit of throughput on a 
 bus / second on the computers we have as on wikipedia there is a huge 
 variance?

 What is the rate of the disks too (multiplied by 10 of course)?  Will they 
 keep up with a 10G rate for bringing a new node online?

 This all comes into play even more so when you want to double the size of 
 your cluster of course as all nodes have to transfer half of what they have 
 to all the new nodes that come online(cassandra actually has a very data 
 center/rack aware topology to transfer data correctly to not use up all 
 bandwidth unecessarily…I am not sure mongodb has that).  Anyways, just food 
 for thought.

 From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Monday, February 18, 2013 1:39 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org, Vegard Berget 
 p...@fantasista.nomailto:p...@fantasista.no
 Subject: Re: cassandra vs. mongodb quick question

 My experience is repair of 300GB compressed data takes longer than 300GB of 
 uncompressed, but I cannot point to an exact number. Calculating the 
 differences is mostly CPU bound and works on the non compressed data.

 Streaming uses compression (after uncompressing the on disk data).

 So if you have 300GB of compressed data, take a look at how long repair takes 
 and see if you are comfortable with that. You may also want to test replacing 
 a node so you can get the procedure documented and understand how long it 
 takes.

 The idea of the soft 300GB to 500GB limit cam about because of a number of 
 cases where people had 1 TB on a single node and they were surprised it took 
 days to repair or replace. If you know how long things may take, and that 
 fits in your operations then go with it.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 18/02/2013, at 10:08 PM, Vegard Berget 
 p...@fantasista.nomailto:p...@fantasista.no wrote:

 Just out of curiosity :

 When using compression, does this affect this one way or another?  Is 300G 
 (compressed) SSTable size, or total size of data?

 .vegard,

 - Original Message -
 From:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org

 To:
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Cc:

 Sent:
 Mon, 18 Feb 2013 08:41:25 +1300
 Subject:
 Re: cassandra vs. mongodb quick question

 If you have spinning disk and 1G networking and no virtual nodes, I would 
 still say 300G to 500G is a soft limit.

 If you are using virtual nodes, SSD, JBOD disk configuration or faster 
 networking you may go higher.

 The limiting factors are the time it take

Re: Long running nodetool repair

2013-02-19 Thread Wei Zhu

It should not take that long. For my 200G node, it takes about an hour to 
calculate the Merkle tree and then data streaming. 

By the way, how do you know the repair is not done?

If you run nodetool tpstats, it should give you the  AntiEntropy session info, 
active/pending/completed etc. While calculating Merkle tree, you can see the 
progress from nodetool compactionstats. While streaming data, you can see the 
progress from nodetool netstats.

Also you can grep the log by Merkle and repair.



- Original Message -
From: Haithem Jarraya haithem.jarr...@struq.com
To: user@cassandra.apache.org
Sent: Tuesday, February 19, 2013 1:29:19 AM
Subject: Long running nodetool repair


Hi, 


I am new to Cassandra and I am not sure if this is the normal behavior but 
nodetool repair runs for too long even for small dataset per node. As I am 
writing I started a nodetool repair last night at 18:41 and now it's 9:18 and 
it's still running, the size of my data is only ~500mb per node. 
We have 
3 Node cluster in DC1 with RF 3 
1 Node Cluster in DC2 with RF 1 
1 Node cluster in DC3 with RF 1 


and running Cassandra V1.2.1 with 256 vNodes. 


From cassandra logs I do not see AntiEntropy logs anymore only compaction Task 
and FlushWriter. 


Is this a normal behaviour of nodetool repair? 
Is the running time grow linearly with the size of the data? 


Any help or direction will be much appreciated. 




Thanks, 


H

Re: Size Tiered - Leveled Compaction

2013-02-17 Thread Wei Zhu

We doubled the SStable size to 10M. It still generates a lot of SSTable and we 
don't see much difference of the read latency.  We are able to finish the 
compactions after repair within serveral hours. We will increase the SSTable 
size again if we feel the number of SSTable hurts the performance. 

- Original Message -
From: Mike mthero...@yahoo.com
To: user@cassandra.apache.org
Sent: Sunday, February 17, 2013 4:50:40 AM
Subject: Re: Size Tiered - Leveled Compaction

Hello Wei, 

First thanks for this response. 

Out of curiosity, what SSTable size did you choose for your usecase, and what 
made you decide on that number? 

Thanks, 
-Mike 

On 2/14/2013 3:51 PM, Wei Zhu wrote: 

I haven't tried to switch compaction strategy. We started with LCS. 

For us, after massive data imports (5000 w/seconds for 6 days), the first 
repair is painful since there is quite some data inconsistency. For 150G nodes, 
repair brought in about 30 G and created thousands of pending compactions. It 
took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. 
System performance degrades during that time since reads could go to more 
SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can 
and couldn't speed it up. I think it's single threaded and it's not 
recommended to turn on multithread compaction. We even tried that, it didn't 
help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. 
Haven't upgraded yet, hope it works:) 

http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 

Since our cluster is not write intensive, only 100 w/seconds. I don't see any 
pending compactions during regular operation. 

One thing worth mentioning is the size of the SSTable, default is 5M which is 
kind of small for 200G (all in one CF) data set, and we are on SSD. It more 
than 150K files in one directory. (200G/5M = 40K SSTable and each SSTable 
creates 4 files on disk) You might want to watch that and decide the SSTable 
size. 

By the way, there is no concept of Major compaction for LCS. Just for fun, you 
can look at a file called $CFName.json in your data directory and it tells you 
the SSTable distribution among different levels. 

-Wei 

From: Charles Brophy cbro...@zulily.com 
To: user@cassandra.apache.org 
Sent: Thursday, February 14, 2013 8:29 AM 
Subject: Re: Size Tiered - Leveled Compaction 

I second these questions: we've been looking into changing some of our CFs to 
use leveled compaction as well. If anybody here has the wisdom to answer them 
it would be of wonderful help. 

Thanks 
Charles 

On Wed, Feb 13, 2013 at 7:50 AM, Mike  mthero...@yahoo.com  wrote: 

Hello, 

I'm investigating the transition of some of our column families from Size 
Tiered - Leveled Compaction. I believe we have some high-read-load column 
families that would benefit tremendously. 

I've stood up a test DB Node to investigate the transition. I successfully 
alter the column family, and I immediately noticed a large number (1000+) 
pending compaction tasks become available, but no compaction get executed. 

I tried running nodetool sstableupgrade on the column family, and the 
compaction tasks don't move. 

I also notice no changes to the size and distribution of the existing SSTables. 

I then run a major compaction on the column family. All pending compaction 
tasks get run, and the SSTables have a distribution that I would expect from 
LeveledCompaction (lots and lots of 10MB files). 

Couple of questions: 

1) Is a major compaction required to transition from size-tiered to leveled 
compaction? 
2) Are major compactions as much of a concern for LeveledCompaction as their 
are for Size Tiered? 

All the documentation I found concerning transitioning from Size Tiered to 
Level compaction discuss the alter table cql command, but I haven't found too 
much on what else needs to be done after the schema change. 

I did these tests with Cassandra 1.1.9. 

Thanks, 
-Mike

Re: heap usage

2013-02-15 Thread Wei Zhu

We have 250G data and running at 8GB heap and one of the node is OOM during 
repair. 

I checked bloomfilter, only 200M. Not sure how the memory is used, maybe take a 
memory dump and exam that. 

- Original Message -
From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org
Sent: Friday, February 15, 2013 8:16:23 AM
Subject: Re: heap usage

It is not going to be true for long that LCS does not require bloom filters.

https://issues.apache.org/jira/browse/CASSANDRA-5029

Apparently, without bloom filters there are issues.

On Fri, Feb 15, 2013 at 7:29 AM, Blake Manders bl...@crosspixel.net wrote:

 You probably want to look at your bloom filters.  Be forewarned though,
 they're difficult to change; changes to bloom filter settings only apply to
 new SSTables, so they might not be noticeable until a few compactions have
 taken place.

 If that is your issue, and your usage model fits it, a good alternative to
 the slow propagation of higher miss rates is to switch to LCS (which doesn't
 use bloom filters), which won't require you to make the jump to 1.2.

 On Fri, Feb 15, 2013 at 4:06 AM, Reik Schatz reik.sch...@gmail.com wrote:

 Hi,

 recently we are hitting some OOM: Java heap space, so I was investigating
 how the heap is used in Cassandra 1.2+

 We use the calculated 4G heap. Our cluster is 6 nodes, around 750 GB data
 and a replication factor of 3. Row cache is disabled. All key cache and
 memtable settings are left at default.

 Is the primary key index kept in heap memory? We have a bunch of keyspaces
 and column families.

 Thanks,
 Rik

 --

 Blake Manders | CTO

 Cross Pixel, Inc. | 494 8th Ave, Penthouse | NYC 10001

 Website: crosspixel.net
 Twitter: twitter.com/CrossPix

Re: Cluster not accepting insert while one node is down

2013-02-14 Thread Wei Zhu

From the exception, looks like astyanax didn't even try to call Cassandra. My 
guess would be astyanax is token aware, it detects the node is down and it 
doesn't even try. If you use Hector, it might try to write since it's not 
token aware. But As Byran said, it eventually will fail. I guess hinted hand 
off won't help since the write doesn't satisfy CL.ONE.

 From: Bryan Talbot btal...@aeriagames.com
To: user@cassandra.apache.org 
Sent: Thursday, February 14, 2013 8:30 AM
Subject: Re: Cluster not accepting insert while one node is down

Generally data isn't written to whatever node the client connects to.  In your 
case, a row is written to one of the nodes based on the hash of the row key.  
If that one replica node is down, it won't matter which coordinator node you 
attempt a write with CL.ONE: the write will fail.

If you want the write to succeed, you could do any one of: write with CL.ANY, 
increase RF to 2+, write using a row key that hashes to an UP node.

-Bryan

On Thu, Feb 14, 2013 at 2:06 AM, Alain RODRIGUEZ arodr...@gmail.com wrote:

I will let commiters or anyone that has knowledge on Cassandra internal answer 
this.

From what I understand, you should be able to insert data on any up node with 
your configuration...

Alain

2013/2/14 Traian Fratean traian.frat...@gmail.com

You're right as regarding data availability on that node. And my config, being 
the default one, is not suited for a cluster.
What I don't get is that my 67 node was down and I was trying to insert in 66 
node, as can be seen from the stacktrace. Long story short: when node 67 was 
down I could not insert into any machine in the cluster. Not what I was 
expecting.

Thank you for the reply!Traian.

2013/2/14 Alain RODRIGUEZ arodr...@gmail.com

Hi Traian,

There is your problem. You are using RF=1, meaning that each node is 
responsible for its range, and nothing more. So when a node goes down, do 
the math, you just can't read 1/5 of your data.

This is very cool for performances since each node owns its own part of the 
data and any write or read need to reach only one node, but it removes the 
SPOF, which is a main point of using C*. So you have poor availability and 
poor consistency.

An usual configuration with 5 nodes would be RF=3 and both CL (RW) = QUORUM.

This will replicate your data to 2 nodes + the natural endpoints (total of 
3/5 nodes owning any data) and any read or write would need to reach at 
least 2 nodes before being considered as being successful ensuring a strong 
consistency.

This configuration allow you to shut down a node (crash or configuration 
update/rolling restart) without degrading the service (at least allowing you 
to reach any data) but at cost of more data on each node.

Alain

2013/2/14 Traian Fratean traian.frat...@gmail.com

I am using defaults for both RF and CL. As the keyspace was created using 
cassandra-cli the default RF should be 1 as I get it from below:

[default@TestSpace] describe;
Keyspace: TestSpace:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
  Durable Writes: true
    Options: [datacenter1:1]

As for the CL it the Astyanax default, which is 1 for both reads and writes.

Traian.

2013/2/13 Alain RODRIGUEZ arodr...@gmail.com

We probably need more info like the RF of your cluster and CL of your reads 
and writes. Maybe could you also tell us if you use vnodes or not.

I heard that Astyanax was not running very smoothly on 1.2.0, but a bit 
better on 1.2.1. Yet, Netflix didn't release a version of Astyanax for 
C*1.2.

Alain

2013/2/13 Traian Fratean traian.frat...@gmail.com

Hi,

I have a cluster of 5 nodes running Cassandra 1.2.0 . I have a Java 
client with Astyanax 1.56.21.
When a node(10.60.15.67 - diiferent from the one in the stacktrace below) 
went down I get TokenRandeOfflineException and no other data gets 
inserted into any other node from the cluster.

Am I having a configuration issue or this is supposed to happen?

com.netflix.astyanax.connectionpool.impl.CountingConnectionPoolMonitor.trackError(CountingConnectionPoolMonitor.java:81)
 - 
com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException:
 TokenRangeOfflineException: [host=10.60.15.66(10.60.15.66):9160, 
latency=2057(2057), attempts=1]UnavailableException()
com.netflix.astyanax.connectionpool.exceptions.TokenRangeOfflineException:
 TokenRangeOfflineException: [host=10.60.15.66(10.60.15.66):9160, 
latency=2057(2057), attempts=1]UnavailableException()
at 
com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:165)
at 
com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:60)
at 
com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:27)
at 
com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$1.execute(ThriftSyncConnectionFactoryImpl.java:140)
at

Re: Size Tiered - Leveled Compaction

2013-02-14 Thread Wei Zhu

I haven't tried to switch compaction strategy. We started with LCS. 

For us, after massive data imports (5000 w/seconds for 6 days), the first 
repair is painful since there is quite some data inconsistency. For 150G nodes, 
repair brought in about 30 G and created thousands of pending compactions. It 
took almost a day to clear those. Just be prepared LCS is really slow in 1.1.X. 
System performance degrades during that time since reads could go to more 
SSTable, we see 20 SSTable lookup for one read.. (We tried everything we can 
and couldn't speed it up. I think it's single threaded and it's not 
recommended to turn on multithread compaction. We even tried that, it didn't 
help )There is parallel LCS in 1.2 which is supposed to alleviate the pain. 
Haven't upgraded yet, hope it works:)

http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2



Since our cluster is not write intensive, only 100 w/seconds. I don't see any 
pending compactions during regular operation. 

One thing worth mentioning is the size of the SSTable, default is 5M which is 
kind of small for 200G (all in one CF) data set, and we are on SSD.  It more 
than  150K files in one directory. (200G/5M = 40K SSTable and each SSTable 
creates 4 files on disk)  You might want to watch that and decide the SSTable 
size. 

By the way, there is no concept of Major compaction for LCS. Just for fun, you 
can look at a file called $CFName.json in your data directory and it tells you 
the SSTable distribution among different levels. 

-Wei



 From: Charles Brophy cbro...@zulily.com
To: user@cassandra.apache.org 
Sent: Thursday, February 14, 2013 8:29 AM
Subject: Re: Size Tiered - Leveled Compaction
 

I second these questions: we've been looking into changing some of our CFs to 
use leveled compaction as well. If anybody here has the wisdom to answer them 
it would be of wonderful help.

Thanks
Charles


On Wed, Feb 13, 2013 at 7:50 AM, Mike mthero...@yahoo.com wrote:

Hello,

I'm investigating the transition of some of our column families from Size 
Tiered - Leveled Compaction.  I believe we have some high-read-load column 
families that would benefit tremendously.

I've stood up a test DB Node to investigate the transition.  I successfully 
alter the column family, and I immediately noticed a large number (1000+) 
pending compaction tasks become available, but no compaction get executed.

I tried running nodetool sstableupgrade on the column family, and the 
compaction tasks don't move.

I also notice no changes to the size and distribution of the existing SSTables.

I then run a major compaction on the column family.  All pending compaction 
tasks get run, and the SSTables have a distribution that I would expect from 
LeveledCompaction (lots and lots of 10MB files).

Couple of questions:

1) Is a major compaction required to transition from size-tiered to leveled 
compaction?
2) Are major compactions as much of a concern for LeveledCompaction as their 
are for Size Tiered?

All the documentation I found concerning transitioning from Size Tiered to 
Level compaction discuss the alter table cql command, but I haven't found too 
much on what else needs to be done after the schema change.

I did these tests with Cassandra 1.1.9.

Thanks,
-Mike

Re: Why do Datastax docs recommend Java 6?

2013-02-06 Thread Wei Zhu

Anyone has first hand experience with Zing JVM which is claimed to be pauseless? How do they charge, per CPU?Thanks-WeiFrom: Edward Capriolo edlinuxg...@gmail.com To:
 user@cassandra.apache.org  Sent: Wednesday, February 6, 2013 7:07 AM Subject: Re: Why do Datastax docs recommend Java 6?   
Oracle already did this once, It was called jrockit :)http://www.oracle.com/technetwork/middleware/jrockit/overview/index.html
Typically oracle acquires they technology and then the bits are merged with the standard JVM.On Wed, Feb 6, 2013 at 2:13 AM, Viktor Jevdokimov viktor.jevdoki...@adform.com wrote:








I would prefer Oracle to own an Azul’s Zing JVM over any other (GC) to provide it for free for anyone :)








Best regards / Pagarbiai
Viktor Jevdokimov
Senior Developer


Email: 
viktor.jevdoki...@adform.com
Phone: +370 5 212 3063, Fax +370 5 261 0453
J. Jasinskio 16C, LT-01112 Vilnius, Lithuania
Follow us on Twitter: 
@adforminsider
Take a ride with Adform's 
Rich Media Suite














Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property
 of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.






From: jef...@gmail.com [mailto:jef...@gmail.com]

Sent: Wednesday, February 06, 2013 02:23
To: user@cassandra.apache.org
Subject: Re: Why do Datastax docs recommend Java 6?



Oracle now owns the sun hotspot team, which is inarguably the highest powered java vm team in the world. Its still really the epicenter of all java vm development.

Sent from my Verizon Wireless BlackBerry





From: "Ilya Grebnov" i...@metricshub.com



Date: Tue, 5 Feb 2013 14:09:33 -0800


To: user@cassandra.apache.org


ReplyTo: user@cassandra.apache.org



Subject: RE: Why do Datastax docs recommend Java 6?




Also, what is particular reason to use Oracle JDK over Open JDK? Sorry, I could not find this information online.


Thanks,

Ilya


From: Michael Kjellman [mailto:mkjell...@barracuda.com]

Sent: Tuesday, February 05, 2013 7:29 AM
To: user@cassandra.apache.org
Subject: Re: Why do Datastax docs recommend Java 6?




There have been tons of threads/convos on this.





In the early days of Java 7 it was pretty unstable and there was pretty much no convincing reason to use Java 7 over Java 6.






Now that Java 7 has stabilized and Java 6 is EOL it's a reasonable decision to use Java 7 and we do it in production with no issues to speak of.






That being said there was one potential situation we've seen as a community where bootstrapping new node was using 3x more CPU and getting significantly less
 throughput. However, reproducing this consistently never happened AFAIK.





I think until more people use Java 7 in production and prove it doesn't cause any additional bugs/performance issues Datastax will update their docs. Until now
 I'd say it's a safe bet to use Java 7 with Vanilla C* 1.2.1. I hope this helps!





Best,


Michael





From:
Baron Schwartz ba...@xaprb.com
Reply-To: "user@cassandra.apache.org" user@cassandra.apache.org
Date: Tuesday, February 5, 2013 7:21 AM
To: "user@cassandra.apache.org" user@cassandra.apache.org
Subject: Why do Datastax docs recommend Java 6?







The Datastax docs repeatedly say (e.g.
http://www.datastax.com/docs/1.2/install/install_jre) that Java 7 is not recommended, but they don't say why. It would be helpful to know this. Does anyone know?





The same documentation is referenced from the Cassandra wiki, for example,http://wiki.apache.org/cassandra/GettingStarted






- Baron

Re: Estimating write throughput with LeveledCompactionStrategy

2013-02-06 Thread Wei Zhu

I have been struggling with the LCS myself. I observed that for the higher 
level compaction,(from level 4 to 5) it involves much more SSTables than 
compacting from lower level. One compaction could take an hour or more. By the 
way, you set the your SSTable size to be 100M?

Thanks.
-Wei 



 From: Ивaн Cобoлeв sobol...@gmail.com
To: user@cassandra.apache.org 
Sent: Wednesday, February 6, 2013 2:42 AM
Subject: Estimating write throughput with LeveledCompactionStrategy
 
Dear Community,

Could anyone please give me a hand with understanding what am I
missing while trying to model how LeveledCompactionStrategy works:
https://docs.google.com/spreadsheet/ccc?key=0AvNacZ0w52BydDQ3N2ZPSks2OHR1dlFmMVV4d1E2eEE#gid=0

Logs mostly contain something like this:
INFO [CompactionExecutor:2235] 2013-02-06 02:32:29,758
CompactionTask.java (line 221) Compacted to
[chunks-hf-285962-Data.db,chunks-hf-285963-Data.db,chunks-hf-285964-Data.db,chunks-hf-285965-Data.db,chunks-hf-285966-Data.db,chunks-hf-285967-Data.db,chunks-hf-285968-Data.db,chunks-hf-285969-Data.db,chunks-hf-285970-Data.db,chunks-hf-285971-Data.db,chunks-hf-285972-Data.db,chunks-hf-285973-Data.db,chunks-hf-285974-Data.db,chunks-hf-285975-Data.db,chunks-hf-285976-Data.db,chunks-hf-285977-Data.db,chunks-hf-285978-Data.db,chunks-hf-285979-Data.db,chunks-hf-285980-Data.db,].
2,255,863,073 to 1,908,460,931 (~84% of original) bytes for 36,868
keys at 14.965795MB/s.  Time: 121,614ms.

Thus spreadsheet is parameterized with throughput being 15Mb and
survivor ratio of 0.9.

1) Projected result actually differs from what I observe - what am I missing?
2) Are there any metrics on write throughput with LCS per node anyone
could possibly share?

Thank you very much in advance,
Ivan

Re: Cassandra pending compaction tasks keeps increasing

2013-02-01 Thread Wei Zhu

That is must be it.
Yes. it happens to be the seed. I should have tried rebuild. Instead I did 
repair and now I am sitting here waiting for the compaction to finish...

Thanks.
-Wei



 From: Derek Williams de...@fyrie.net
To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com 
Sent: Friday, February 1, 2013 1:56 PM
Subject: Re: Cassandra pending compaction tasks keeps increasing
 

Did the node list itself as a seed node in cassandra.yaml? Unless something has 
changed, a node that considers itself a seed will not auto bootstrap. Although 
I haven't tried it, I think running 'nodetool rebuild' will cause it to stream 
in the data it needs without doing a repair.



On Wed, Jan 30, 2013 at 9:30 PM, Wei Zhu wz1...@yahoo.com wrote:

Some updates:
Since we still have not fully turned on the system. We did something crazy 
today. We tried to treat the node as dead one. (My boss wants us to practice 
replacing a dead node before going to full production) and boot strap it. Here 
is what we did:


   * drain the node
   * check nodetool on other nodes, and this node is marked down (the 
 token for this node is 100)

   * clear the data, commit log, saved cache
   * change initial_token from 100 to 99 in the yaml file
   * start the node
   * check nodetool, the down node of 100 disappeared by itself (!!) and 
 new node with token 99 showed up
   * checked log, see the message saying bootstrap completed. But only a 
 couple of MB streamed. 

   * nodetool movetoken 98
   * nodetool, see the node with token 98 comes up. 

   * check log, see the message saying bootstrap completed. But still only 
 a couple of MB streamed. The only reason I can think of is that the new node 
 has the same IP as the dead node we tried to replace? Will that cause  the 
 symptom of no data streamed from other nodes? Other nodes still think the 
 node had all the data?

We had to do nodetool repair -pr to bring in the data. After 3 hours, 150G  
transferred. And no surprise, pending compaction tasks are now at 30K. There 
are about 30K SStable transferred and I guess all of them needs to be 
compacted since we use LCS.

My concern is that if we did nothing wrong, replacing a dead node will cause 
such a hugh back log of pending compaction. It might take a week to clear that 
off. And we have RF = 3, we still need to bring in the data for the other two 
replicates since we use pr for nodetool repair. It will take about 3 weeks 
to fully replace a
 200G node using LCS? We tried everything we can to speed up the compaction and 
no luck. The only thing I can think of is to increase the default size of 
SSTable, so less number of compaction will be needed. Can I just change it in 
yaml and restart C* and it will correct itself? Any side effect? Since we are 
using SSD, a bit bigger SSD won't slow down the read too much, I suppose that 
is the main concern for bigger size of SSTable?
 
I think 1.2 comes with parallel LC which should help the situation. But we are 
not going to upgrade for a little while.

Did I miss anything? It might not be practical to use LCS for 200G node? But 
if  we use Sized compaction, we need to have at least 400G for the 
HD...Although SSD is cheap now, still hard to convince the management. three 
replicates + double the Disk for compaction? that is 6 times of the real data 
size!

Sorry for the long email. Any suggestion or advice?

Thanks.
-Wei 



From: aaron morton aa...@thelastpickle.com
To: Cassandra User user@cassandra.apache.org
Sent: Tuesday, January 29, 2013 12:59:42 PM

Subject: Re: Cassandra pending compaction tasks keeps increasing


* Will try it tomorrow. Do I need to restart server to change the log level?
You can set it via JMX, and supposedly log4j is configured to watch the 
config file. 


Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand


@aaronmorton
http://www.thelastpickle.com

On 29/01/2013, at 9:36 PM, Wei Zhu wz1...@yahoo.com wrote:

Thanks for the reply. Here is some information:

Do you have wide rows ? Are you seeing logging about Compacting wide rows ? 

* I don't see any log about wide rows

Are you seeing GC activity logged or seeing CPU steal on a VM ? 

* There is some GC, but CPU general is under 20%. We have heap size of 8G, 
RAM is at 72G.

Have you tried disabling multithreaded_compaction ? 

* By default, it's disabled. We enabled it, but doesn't see much difference. 
Even a little slower with it's enabled. Is it bad to enable it? We have SSD, 
according to comment in yaml, it should help while using SSD.

Are you using Key Caches ? Have you tried disabling 
compaction_preheat_key_cache? 

* We have fairly big Key caches, we set as 10%
 of Heap which is 800M. Yes, compaction_preheat_key_cache is disabled. 

Can you enabled DEBUG level logging and make them available ? 

* Will try it tomorrow. Do I need to restart server to change

General question regarding bootstrap and nodetool repair

2013-01-31 Thread Wei Zhu

Hi,
After messing around with my Cassandra cluster recently, I think I need some 
basic understanding on how things work behind scene regarding data streaming.
Let's say we have three node cluster with RF = 3.  If node 3 for some reason 
dies and I want to replace it with a new node with the same (maybe minus one) 
range. During the bootstrap, how the data is streamed?
From what I observed, Node 3 has replicates for its primary range on node 4, 5. 
So it streams the data from them and starts to compact them. Also, node 3 holds 
replicates for primary range of node 2, so it streams data from node 2 and node 
4. Similarly, it holds replicates for node 1. So data streamed from node 1 and 
node 2. So during the bootstaping, it basically gets the data from all the 
replicates (2 copies each), so it will require double the disk space in order 
to hold the data? Over the time, those SStables will be compacted and redundant 
will be removed? Is it true?

if we issue nodetool repair -pr on node 3, apart from streaming data from node 
4, 5 to 3. We also see data stream between node 4, 5 since they hold the 
replicates. But I don't see log regarding merkle tree calculation on node 
4,5. Just wondering how they know what data to stream in order to repair node 
4, 5?

Thanks.
-Wei

Re: General question regarding bootstrap and nodetool repair

2013-01-31 Thread Wei Zhu

I decided to dig in to the source code, looks like in the case of nodetool 
repair, if the current node sees the difference between the remote nodes based 
on the merkle tree calculation, it will start a streamrepair session to ask the 
remote nodes to stream data between  each other. 

But I am still not sure how about the my first question regarding the 
bootstrap, anyone?

Thanks.
-Wei


 From: Wei Zhu wz1...@yahoo.com
To: Cassandr usergroup user@cassandra.apache.org 
Sent: Thursday, January 31, 2013 10:50 AM
Subject: General question regarding bootstrap and nodetool repair
 

Hi,
After messing around with my Cassandra cluster recently, I think I need some 
basic understanding on how things work behind scene regarding data streaming.
Let's say we have three node cluster with RF = 3.  If node 3 for some reason 
dies and I want to replace it with a new node with the same (maybe minus one) 
range. During the bootstrap, how the data is streamed?
From what I observed, Node 3 has replicates for its primary range on node 4, 5. 
So it streams the data from them and starts to compact them. Also, node 3 holds 
replicates for primary range of node 2, so it streams data from node 2 and node 
4. Similarly, it holds replicates for node 1. So data streamed from node 1 and 
node 2. So during the bootstaping, it basically gets the data from all the 
replicates (2 copies each), so it will require double the disk space in order 
to hold the data? Over the time, those SStables will be compacted and redundant 
will be removed? Is it true?

if we issue nodetool repair -pr on node 3, apart from streaming data from node 
4, 5 to 3. We also see data stream between node 4, 5 since they hold the 
replicates. But I don't see log regarding merkle tree calculation on node 
4,5. Just wondering how they know what data to stream in order to repair node 
4, 5?

Thanks.
-Wei

Re: General question regarding bootstrap and nodetool repair

2013-01-31 Thread Wei Zhu

Thanks Rob. I think you are right on it.

Here is what I found:

https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L140


It sorts the end point by proximity and in 

https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L171


It fetches the data from the only one source.

That answers my question. So we will have to run repair after the bootstrap to 
make sure the consistency. 

Thanks.
-Wei




 From: Rob Coli rc...@palominodb.com
To: user@cassandra.apache.org 
Sent: Thursday, January 31, 2013 1:50 PM
Subject: Re: General question regarding bootstrap and nodetool repair
 
On Thu, Jan 31, 2013 at 12:19 PM, Wei Zhu wz1...@yahoo.com wrote:
 But I am still not sure how about the my first question regarding the
 bootstrap, anyone?

As I understand it, bootstrap occurs from a single replica. Which
replica is chosen is based on some internal estimation of which is
closest/least loaded/etc. But only from a single replica, so in RF=3,
in order to be consistent with both you still have to run a repair.

=Rob

-- 
=Robert Coli
AIMGTALK - rc...@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

Re: General question regarding bootstrap and nodetool repair

2013-01-31 Thread Wei Zhu

One more question though, 
I tried to replace a node with a new node with the same IP, Here is what we 
did: 

* drain the node 
* check nodetool on other nodes, and this node is marked down (the token for 
this node is 100) 
* clear the data, commit log, saved cache on the down node. 
* change initial_token from 100 to 99 in the yaml file 
* start the node 
* check nodetool, the down node of 100 disappeared by itself (!!) and new node 
with token 99 showed up 
* checked log, see the message saying bootstrap completed. But only a couple of 
MB streamed. 
* nodetool movetoken 98 
* nodetool, see the node with token 98 comes up. 
* check log, see the message saying bootstrap completed. But still only a 
couple of MB streamed. 

The only reason I can think of is that the new node has the same IP as the 
dead node we tried to replace? After reading the bootstrap code, it shouldn't 
be the case. Is it a bug? Or anyone tried to replace a dead node with the same 
IP? 

Thanks. 
-Wei 


- Original Message -

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Thursday, January 31, 2013 3:14:59 PM 
Subject: Re: General question regarding bootstrap and nodetool repair 



Thanks Rob. I think you are right on it. 


Here is what I found: 


https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L140
 



It sorts the end point by proximity and in 


https://github.com/apache/cassandra/blob/cassandra-1.1.0/src/java/org/apache/cassandra/dht/RangeStreamer.java#L171
 



It fetches the data from the only one source. 


That answers my question. So we will have to run repair after the bootstrap to 
make sure the consistency. 


Thanks. 
-Wei 







From: Rob Coli rc...@palominodb.com 
To: user@cassandra.apache.org 
Sent: Thursday, January 31, 2013 1:50 PM 
Subject: Re: General question regarding bootstrap and nodetool repair 

On Thu, Jan 31, 2013 at 12:19 PM, Wei Zhu  wz1...@yahoo.com  wrote: 


But I am still not sure how about the my first question regarding the 
bootstrap, anyone? 

As I understand it, bootstrap occurs from a single replica. Which 


replica is chosen is based on some internal estimation of which is 
closest/least loaded/etc. But only from a single replica, so in RF=3, 
in order to be consistent with both you still have to run a repair. 

=Rob 

-- 
=Robert Coli 
AIMGTALK - rc...@palominodb.com 
YAHOO - rcoli.palominob 
SKYPE - rcoli_palominodb

Re: Cassandra pending compaction tasks keeps increasing

2013-01-30 Thread Wei Zhu

Some updates: 
Since we still have not fully turned on the system. We did something crazy 
today. We tried to treat the node as dead one. (My boss wants us to practice 
replacing a dead node before going to full production) and boot strap it. Here 
is what we did: 


* drain the node 
* check nodetool on other nodes, and this node is marked down (the token 
for this node is 100) 
* clear the data, commit log, saved cache 
* change initial_token from 100 to 99 in the yaml file 
* start the node 
* check nodetool, the down node of 100 disappeared by itself (!!) and new 
node with token 99 showed up 
* checked log, see the message saying bootstrap completed. But only a 
couple of MB streamed. 
* nodetool movetoken 98 
* nodetool, see the node with token 98 comes up. 
* check log, s ee the message saying bootstrap completed. But still only a 
couple of MB streamed. 

The only reason I can think of is that the new node has the same IP as the 
dead node we tried to replace? Will that cause the symptom of no data 
streamed from other nodes? Other nodes still think the node had all the data? 

We had to do nodetool repair -pr to bring in the data. After 3 hours, 150G 
transferred. And no surprise, pending compaction tasks are now at 30K. There 
are about 30K SStable transferred and I guess all of them needs to be compacted 
since we use LCS. 

My concern is that if we did nothing wrong, replacing a dead node will cause 
such a hugh back log of pending compaction. It might take a week to clear that 
off. And we have RF = 3, we still need to bring in the data for the other two 
replicates since we use pr for nodetool repair. It will take about 3 weeks to 
fully replace a 200G node using LCS? We tried everything we can to speed up the 
compaction and no luck. The only thing I can think of is to increase the 
default size of SSTable, so less number of compaction will be needed. Can I 
just change it in yaml and restart C* and it will correct itself? Any side 
effect? Since we are using SSD, a bit bigger SSD won't slow down the read too 
much, I suppose that is the main concern for bigger size of SSTable? 

I think 1.2 comes with parallel LC which should help the situation. But we are 
not going to upgrade for a little while. 

Did I miss anything? It might not be practical to use LCS for 200G node? But if 
we use Sized compaction, we need to have at least 400G for the HD...Although 
SSD is cheap now, still hard to convince the management. three replicates + 
double the Disk for compaction? that is 6 times of the real data size! 

Sorry for the long email. Any suggestion or advice? 

Thanks. 
-Wei 

- Original Message -

From: aaron morton aa...@thelastpickle.com 
To: Cassandra User user@cassandra.apache.org 
Sent: Tuesday, January 29, 2013 12:59:42 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 



* Will try it tomorrow. Do I need to restart server to change the log level? 


You can set it via JMX, and supposedly log4j is configured to watch the config 
file. 


Cheers 








- 
Aaron Morton 
Freelance Cassandra Developer 
New Zealand 


@aaronmorton 
http://www.thelastpickle.com 


On 29/01/2013, at 9:36 PM, Wei Zhu  wz1...@yahoo.com  wrote: 

blockquote
Thanks for the reply. Here is some information: 

Do you have wide rows ? Are you seeing logging about Compacting wide rows ? 

* I don't see any log about wide rows 

Are you seeing GC activity logged or seeing CPU steal on a VM ? 

* There is some GC, but CPU general is under 20%. We have heap size of 8G, RAM 
is at 72G. 

Have you tried disabling multithreaded_compaction ? 

* By default, it's disabled. We enabled it, but doesn't see much difference. 
Even a little slower with it's enabled. Is it bad to enable it? We have SSD, 
according to comment in yaml, it should help while using SSD. 

Are you using Key Caches ? Have you tried disabling 
compaction_preheat_key_cache? 

* We have fairly big Key caches, we set as 10% of Heap which is 800M. Yes, 
compaction_preheat_key_cache is disabled. 

Can you enabled DEBUG level logging and make them available ? 

* Will try it tomorrow. Do I need to restart server to change the log level? 


-Wei 

- Original Message -

From: aaron morton  aa...@thelastpickle.com  
To: user@cassandra.apache.org 
Sent: Monday, January 28, 2013 11:31:42 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 







* Why nodetool repair increases the data size that much? It's not likely that 
much data needs to be repaired. Will that happen for all the subsequent repair? 
Repair only detects differences in entire rows. If you have very wide rows then 
small differences in rows can result in a large amount of streaming. 
Streaming creates new SSTables on the receiving side, which then need to be 
compacted. So repair often results in compaction doing it's thing for a while. 








* How to make LCS run faster? After almost a day

Re: Cassandra pending compaction tasks keeps increasing

2013-01-29 Thread Wei Zhu

Thanks for the reply. Here is some information:

Do you have wide rows ? Are you seeing logging about Compacting wide rows ? 

* I don't see any log about wide rows

Are you seeing GC activity logged or seeing CPU steal on a VM ? 

* There is some GC, but CPU general is under 20%. We have heap size of 8G, RAM 
is at 72G.

Have you tried disabling multithreaded_compaction ? 

* By default, it's disabled. We enabled it, but doesn't see much difference. 
Even a little slower with it's enabled. Is it bad to enable it? We have SSD, 
according to comment in yaml, it should help while using SSD.

Are you using Key Caches ? Have you tried disabling 
compaction_preheat_key_cache? 

* We have fairly big Key caches, we set as 10% of Heap which is 800M. Yes, 
compaction_preheat_key_cache is disabled. 

Can you enabled DEBUG level logging and make them available ? 

* Will try it tomorrow. Do I need to restart server to change the log level?


-Wei

- Original Message -
From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org
Sent: Monday, January 28, 2013 11:31:42 PM
Subject: Re: Cassandra pending compaction tasks keeps increasing







* Why nodetool repair increases the data size that much? It's not likely that 
much data needs to be repaired. Will that happen for all the subsequent repair? 
Repair only detects differences in entire rows. If you have very wide rows then 
small differences in rows can result in a large amount of streaming. 
Streaming creates new SSTables on the receiving side, which then need to be 
compacted. So repair often results in compaction doing it's thing for a while. 








* How to make LCS run faster? After almost a day, the LCS tasks only dropped by 
1000. I am afraid it will never catch up. We set 


This is going to be tricky to diagnose, sorry for asking silly questions... 


Do you have wide rows ? Are you seeing logging about Compacting wide rows ? 
Are you seeing GC activity logged or seeing CPU steal on a VM ? 
Have you tried disabling multithreaded_compaction ? 
Are you using Key Caches ? Have you tried disabling 
compaction_preheat_key_cache? 
Can you enabled DEBUG level logging and make them available ? 


Cheers 








- 
Aaron Morton 
Freelance Cassandra Developer 
New Zealand 


@aaronmorton 
http://www.thelastpickle.com 


On 29/01/2013, at 8:59 AM, Derek Williams  de...@fyrie.net  wrote: 



I could be wrong about this, but when repair is run, it isn't just values that 
are streamed between nodes, it's entire sstables. This causes a lot of 
duplicate data to be written which was already correct on the node, which needs 
to be compacted away. 


As for speeding it up, no idea. 



On Mon, Jan 28, 2013 at 12:16 PM, Wei Zhu  wz1...@yahoo.com  wrote: 


Any thoughts? 


Thanks. 
-Wei 

- Original Message - 

From: Wei Zhu  wz1...@yahoo.com  
To: user@cassandra.apache.org 

Sent: Friday, January 25, 2013 10:09:37 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 



To recap the problem, 
1.1.6 on SSD, 5 nodes, RF = 3, one CF only. 
After data load, initially all 5 nodes have very even data size (135G, each). I 
ran nodetool repair -pr on node 1 which have replicates on node 2, node 3 since 
we set RF = 3. 
It appears that huge amount of data got transferred. Node 1 has 220G, node 2, 3 
have around 170G. Pending LCS task on node 1 is 15K and node 2, 3 have around 
7K each. 
Questions: 

* Why nodetool repair increases the data size that much? It's not likely that 
much data needs to be repaired. Will that happen for all the subsequent repair? 
* How to make LCS run faster? After almost a day, the LCS tasks only dropped by 
1000. I am afraid it will never catch up. We set 


* compaction_throughput_mb_per_sec = 500 
* multithreaded_compaction: true 



Both Disk and CPU util are less than 10%. I understand LCS is single threaded, 
any chance to speed it up? 


* We use default SSTable size as 5M, Will increase the size of SSTable help? 
What will happen if I change the setting after the data is loaded. 


Any suggestion is very much appreciated. 

-Wei 


- Original Message - 

From: Wei Zhu  wz1...@yahoo.com  
To: user@cassandra.apache.org 

Sent: Thursday, January 24, 2013 11:46:04 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

I believe I am running into this one: 

https://issues.apache.org/jira/browse/CASSANDRA-4765 

By the way, I am using 1.1.6 (I though I was using 1.1.7) and this one is fixed 
in 1.1.7. 



- Original Message - 

From: Wei Zhu  wz1...@yahoo.com  
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 11:18:59 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Thanks Derek, 
in the cassandra-env.sh, it says 

# reduce the per-thread stack size to minimize the impact of Thrift 
# thread-per-client. (Best practice is for client connections to 
# be pooled anyway.) Only do so on Linux where it is known

Re: Cassandra pending compaction tasks keeps increasing

2013-01-28 Thread Wei Zhu

Any thoughts?

Thanks.
-Wei

- Original Message -
From: Wei Zhu wz1...@yahoo.com
To: user@cassandra.apache.org
Sent: Friday, January 25, 2013 10:09:37 PM
Subject: Re: Cassandra pending compaction tasks keeps increasing

To recap the problem, 
1.1.6 on SSD, 5 nodes, RF = 3, one CF only. 
After data load, initially all 5 nodes have very even data size (135G, each). I 
ran nodetool repair -pr on node 1 which have replicates on node 2, node 3 since 
we set RF = 3. 
It appears that huge amount of data got transferred. Node 1 has 220G, node 2, 3 
have around 170G. Pending LCS task on node 1 is 15K and node 2, 3 have around 
7K each. 
Questions: 

* Why nodetool repair increases the data size that much? It's not likely 
that much data needs to be repaired. Will that happen for all the subsequent 
repair? 
* How to make LCS run faster? After almost a day, the LCS tasks only 
dropped by 1000. I am afraid it will never catch up. We set 

* compaction_throughput_mb_per_sec = 500 
* multithreaded_compaction: true 

Both Disk and CPU util are less than 10%. I understand LCS is single threaded, 
any chance to speed it up? 

* We use default SSTable size as 5M, Will increase the size of SSTable 
help? What will happen if I change the setting after the data is loaded. 

Any suggestion is very much appreciated. 

-Wei 

- Original Message -

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 11:46:04 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

I believe I am running into this one: 

https://issues.apache.org/jira/browse/CASSANDRA-4765 

By the way, I am using 1.1.6 (I though I was using 1.1.7) and this one is fixed 
in 1.1.7. 

- Original Message -

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 11:18:59 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Thanks Derek, 
in the cassandra-env.sh, it says 

# reduce the per-thread stack size to minimize the impact of Thrift 
# thread-per-client. (Best practice is for client connections to 
# be pooled anyway.) Only do so on Linux where it is known to be 
# supported. 
# u34 and greater need 180k 
JVM_OPTS=$JVM_OPTS -Xss180k 

What value should I use? Java defaults at 400K? Maybe try that first. 

Thanks. 
-Wei 

- Original Message - 
From: Derek Williams de...@fyrie.net 
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com 
Sent: Thursday, January 24, 2013 11:06:00 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Increasing the stack size in cassandra-env.sh should help you get past the 
stack overflow. Doesn't help with your original problem though. 

On Fri, Jan 25, 2013 at 12:00 AM, Wei Zhu  wz1...@yahoo.com  wrote: 

Well, even after restart, it throws the the same exception. I am basically 
stuck. Any suggestion to clear the pending compaction tasks? Below is the end 
of stack trace: 

at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$3.iterator(Sets.java:667) 
at com.google.common.collect.Sets$3.size(Sets.java:670) 
at com.google.common.collect.Iterables.size(Iterables.java:80) 
at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:557) 
at 
org.apache.cassandra.db.compaction.CompactionController.init(CompactionController.java:69)

at 
org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:105)

at 
org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)

at 
org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)

at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) 
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) 
at java.util.concurrent.FutureTask.run(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
at java.lang.Thread.run(Unknown Source) 

Any suggestion is very much appreciated 

-Wei 

- Original Message - 
From: Wei Zhu  wz1...@yahoo.com  
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 10:55:07 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Do you mean 90% of the reads should come from 1 SSTable? 

By the way, after I finished the data migrating, I ran nodetool repair -pr on 
one of the nodes. Before nodetool repair, all the nodes have the same disk 
space usage. After I ran the nodetool repair, the disk space for that node 
jumped from 135G to 220G, also there are more than 15000 pending compaction 
tasks. After

Re: Cassandra pending compaction tasks keeps increasing

2013-01-28 Thread Wei Zhu

Two fundamental questions: 


* Why did nodetool repairs bring so much data. A lot of SSTables are 
created, disk space almost doubled. 
* Why does level compactions run so slow? We turned off throtting 
completely and don't see much utilization of the SSD and CPU. One example, 
0.7MB/s on SSD? That is insane. Anything I can do to speed it up? 


* 1,837,023,925 to 1,836,694,446 (~99% of original) bytes for 1,686,604 
keys at 0.717223MB/s. Time: 2,442,208ms. 

Thanks. 
-Wei 
- Original Message -

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Monday, January 28, 2013 11:16:47 AM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Any thoughts? 

Thanks. 
-Wei 

- Original Message -

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Friday, January 25, 2013 10:09:37 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 


To recap the problem, 
1.1.6 on SSD, 5 nodes, RF = 3, one CF only. 
After data load, initially all 5 nodes have very even data size (135G, each). I 
ran nodetool repair -pr on node 1 which have replicates on node 2, node 3 since 
we set RF = 3. 
It appears that huge amount of data got transferred. Node 1 has 220G, node 2, 3 
have around 170G. Pending LCS task on node 1 is 15K and node 2, 3 have around 
7K each. 
Questions: 

* Why nodetool repair increases the data size that much? It's not likely that 
much data needs to be repaired. Will that happen for all the subsequent repair? 
* How to make LCS run faster? After almost a day, the LCS tasks only dropped by 
1000. I am afraid it will never catch up. We set 


* compaction_throughput_mb_per_sec = 500 
* multithreaded_compaction: true 


Both Disk and CPU util are less than 10%. I understand LCS is single threaded, 
any chance to speed it up? 


* We use default SSTable size as 5M, Will increase the size of SSTable help? 
What will happen if I change the setting after the data is loaded. 

Any suggestion is very much appreciated. 

-Wei 

- Original Message - 

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 11:46:04 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

I believe I am running into this one: 

https://issues.apache.org/jira/browse/CASSANDRA-4765 

By the way, I am using 1.1.6 (I though I was using 1.1.7) and this one is fixed 
in 1.1.7. 

- Original Message - 

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 11:18:59 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Thanks Derek, 
in the cassandra-env.sh, it says 

# reduce the per-thread stack size to minimize the impact of Thrift 
# thread-per-client. (Best practice is for client connections to 
# be pooled anyway.) Only do so on Linux where it is known to be 
# supported. 
# u34 and greater need 180k 
JVM_OPTS=$JVM_OPTS -Xss180k 

What value should I use? Java defaults at 400K? Maybe try that first. 

Thanks. 
-Wei 

- Original Message - 
From: Derek Williams de...@fyrie.net 
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com 
Sent: Thursday, January 24, 2013 11:06:00 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 


Increasing the stack size in cassandra-env.sh should help you get past the 
stack overflow. Doesn't help with your original problem though. 



On Fri, Jan 25, 2013 at 12:00 AM, Wei Zhu  wz1...@yahoo.com  wrote: 


Well, even after restart, it throws the the same exception. I am basically 
stuck. Any suggestion to clear the pending compaction tasks? Below is the end 
of stack trace: 

at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$3.iterator(Sets.java:667) 
at com.google.common.collect.Sets$3.size(Sets.java:670) 
at com.google.common.collect.Iterables.size(Iterables.java:80) 
at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:557) 
at 
org.apache.cassandra.db.compaction.CompactionController.init(CompactionController.java:69)
 
at 
org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:105)
 
at 
org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 
at 
org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
 
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) 
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) 
at java.util.concurrent.FutureTask.run(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source

Re: Cassandra pending compaction tasks keeps increasing

2013-01-25 Thread Wei Zhu

To recap the problem, 
1.1.6 on SSD, 5 nodes, RF = 3, one CF only. 
After data load, initially all 5 nodes have very even data size (135G, each). I 
ran nodetool repair -pr on node 1 which have replicates on node 2, node 3 since 
we set RF = 3. 
It appears that huge amount of data got transferred. Node 1 has 220G, node 2, 3 
have around 170G. Pending LCS task on node 1 is 15K and node 2, 3 have around 
7K each. 
Questions: 

* Why nodetool repair increases the data size that much? It's not likely 
that much data needs to be repaired. Will that happen for all the subsequent 
repair? 
* How to make LCS run faster? After almost a day, the LCS tasks only 
dropped by 1000. I am afraid it will never catch up. We set 


* compaction_throughput_mb_per_sec = 500 
* multithreaded_compaction: true 


Both Disk and CPU util are less than 10%. I understand LCS is single threaded, 
any chance to speed it up? 


* We use default SSTable size as 5M, Will increase the size of SSTable 
help? What will happen if I change the setting after the data is loaded. 

Any suggestion is very much appreciated. 

-Wei 

- Original Message -

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 11:46:04 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

I believe I am running into this one: 

https://issues.apache.org/jira/browse/CASSANDRA-4765 

By the way, I am using 1.1.6 (I though I was using 1.1.7) and this one is fixed 
in 1.1.7. 

- Original Message -

From: Wei Zhu wz1...@yahoo.com 
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 11:18:59 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Thanks Derek, 
in the cassandra-env.sh, it says 

# reduce the per-thread stack size to minimize the impact of Thrift 
# thread-per-client. (Best practice is for client connections to 
# be pooled anyway.) Only do so on Linux where it is known to be 
# supported. 
# u34 and greater need 180k 
JVM_OPTS=$JVM_OPTS -Xss180k 

What value should I use? Java defaults at 400K? Maybe try that first. 

Thanks. 
-Wei 

- Original Message - 
From: Derek Williams de...@fyrie.net 
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com 
Sent: Thursday, January 24, 2013 11:06:00 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 


Increasing the stack size in cassandra-env.sh should help you get past the 
stack overflow. Doesn't help with your original problem though. 



On Fri, Jan 25, 2013 at 12:00 AM, Wei Zhu  wz1...@yahoo.com  wrote: 


Well, even after restart, it throws the the same exception. I am basically 
stuck. Any suggestion to clear the pending compaction tasks? Below is the end 
of stack trace: 

at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$3.iterator(Sets.java:667) 
at com.google.common.collect.Sets$3.size(Sets.java:670) 
at com.google.common.collect.Iterables.size(Iterables.java:80) 
at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:557) 
at 
org.apache.cassandra.db.compaction.CompactionController.init(CompactionController.java:69)
 
at 
org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:105)
 
at 
org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 
at 
org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
 
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) 
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) 
at java.util.concurrent.FutureTask.run(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
at java.lang.Thread.run(Unknown Source) 

Any suggestion is very much appreciated 

-Wei 



- Original Message - 
From: Wei Zhu  wz1...@yahoo.com  
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 10:55:07 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Do you mean 90% of the reads should come from 1 SSTable? 

By the way, after I finished the data migrating, I ran nodetool repair -pr on 
one of the nodes. Before nodetool repair, all the nodes have the same disk 
space usage. After I ran the nodetool repair, the disk space for that node 
jumped from 135G to 220G, also there are more than 15000 pending compaction 
tasks. After a while , Cassandra started to throw the exception like below and 
stop compacting. I had to restart the node. By the way, we are using 1.1.7. 
Something doesn't seem right. 


INFO [CompactionExecutor:108804] 2013-01-24 22:23

Does setstreamthroughput also throttle the network traffic caused by nodetool repair?

2013-01-24 Thread Wei Zhu

In the yaml, it has the following setting

# Throttles all outbound streaming file transfers on this node to the
# given total throughput in Mbps. This is necessary because Cassandra does
# mostly sequential IO when streaming data during bootstrap or repair, which
# can lead to saturating the network connection and degrading rpc performance.
# When unset, the default is 400 Mbps or 50 MB/s.
# stream_throughput_outbound_megabits_per_sec: 400

Is this the same value as if I call

Nodetool setstreamthroughput 

Should I call it to all the nodes on the cluster? Will that throttle the 
network traffic caused by nodetool repair?

Thanks.
-Wei

Re: Cassandra pending compaction tasks keeps increasing

2013-01-24 Thread Wei Zhu

Do you mean 90% of the reads should come from 1 SSTable? 

By the way, after I finished the data migrating, I ran nodetool repair -pr on 
one of the nodes. Before nodetool repair, all the nodes have the same disk 
space usage. After I ran the nodetool repair, the disk space for that node 
jumped from 135G to 220G, also there are more than 15000 pending compaction 
tasks. After a while , Cassandra started to throw the exception like below and 
stop compacting. I had to restart the node. By the way, we are using 1.1.7. 
Something doesn't seem right.


 INFO [CompactionExecutor:108804] 2013-01-24 22:23:10,427 CompactionTask.java 
(line 109) Compacting 
[SSTableReader(path='/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-753782-Data.db')]
 INFO [CompactionExecutor:108804] 2013-01-24 22:23:11,610 CompactionTask.java 
(line 221) Compacted to 
[/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754996-Data.db,].  
5,259,403 to 5,259,403 (~100% of original) bytes for 1,983 keys at 
4.268730MB/s.  Time: 1,175ms.
 INFO [CompactionExecutor:108805] 2013-01-24 22:23:11,617 CompactionTask.java 
(line 109) Compacting 
[SSTableReader(path='/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754880-Data.db')]
 INFO [CompactionExecutor:108805] 2013-01-24 22:23:12,828 CompactionTask.java 
(line 221) Compacted to 
[/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754997-Data.db,].  
5,272,746 to 5,272,746 (~100% of original) bytes for 1,941 keys at 
4.152339MB/s.  Time: 1,211ms.
ERROR [CompactionExecutor:108806] 2013-01-24 22:23:13,048 
AbstractCassandraDaemon.java (line 135) Exception in thread 
Thread[CompactionExecutor:108806,1,main]
java.lang.StackOverflowError
at java.util.AbstractList$Itr.hasNext(Unknown Source)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:517)
at com.google.common.collect.Iterators$3.hasNext(Iterators.java:114)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:517)
at com.google.common.collect.Iterators$3.hasNext(Iterators.java:114)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:517)
at com.google.common.collect.Iterators$3.hasNext(Iterators.java:114)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:517)
at com.google.common.collect.Iterators$3.hasNext(Iterators.java:114)


- Original Message -
From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org
Sent: Wednesday, January 23, 2013 2:40:45 PM
Subject: Re: Cassandra pending compaction tasks keeps increasing

The histogram does not look right to me, too many SSTables for an LCS CF. 


It's a symptom no a cause. If LCS is catching up though it should be more like 
the distribution in the linked article. 


Cheers 








- 
Aaron Morton 
Freelance Cassandra Developer 
New Zealand 


@aaronmorton 
http://www.thelastpickle.com 


On 23/01/2013, at 10:57 AM, Jim Cistaro  jcist...@netflix.com  wrote: 




What version are you using? Are you seeing any compaction related assertions in 
the logs? 


Might be https://issues.apache.org/jira/browse/CASSANDRA-4411 


We experienced this problem of the count only decreasing to a certain number 
and then stopping. If you are idle, it should go to 0. I have not seen it 
overestimate for zero, only for non-zero amounts. 


As for timeouts etc, you will need to look at things like nodetool tpstats to 
see if you have pending transactions queueing up. 


Jc 


From: Wei Zhu  wz1...@yahoo.com  
Reply-To:  user@cassandra.apache.org   user@cassandra.apache.org , Wei Zhu 
 wz1...@yahoo.com  
Date: Tuesday, January 22, 2013 12:56 PM 
To:  user@cassandra.apache.org   user@cassandra.apache.org  
Subject: Re: Cassandra pending compaction tasks keeps increasing 






Thanks Aaron and Jim for your reply. The data import is done. We have about 
135G on each node and it's about 28K SStables. For normal operation, we only 
have about 90 writes per seconds, but when I ran nodetool compationstats, it 
remains at 9 and hardly changes. I guess it's just an estimated number. 


When I ran histogram, 



Offset SSTables Write Latency Read Latency Row Size Column Count 
1 2644 0 0 0 18660057 
2 8204 0 0 0 9824270 
3 11198 0 0 0 6968475 
4 4269 6 0 0 5510745 
5 517 29 0 0 4595205 




You can see about half of the reads result in 3 SSTables. Majority of read 
latency are under 5ms, only a dozen are over 10ms. We haven't fully turn on 
reads yet, only 60 reads per second. We see about 20 read timeout during the 
past 12 hours. Not a single warning from Cassandra Log. 


Is it normal for Cassandra to timeout some requests? We set rpc timeout to be 
1s, it shouldn't time out any of them? 


Thanks. 
-Wei 





From: aaron morton  aa...@thelastpickle.com  
To: user@cassandra.apache.org 
Sent: Monday, January 21, 2013 12:21 AM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 



The main guarantee LCS gives you is that most reads will only touch 1

Re: Cassandra pending compaction tasks keeps increasing

2013-01-24 Thread Wei Zhu

Thanks Derek,
in the cassandra-env.sh, it says 

# reduce the per-thread stack size to minimize the impact of Thrift 


# thread-per-client.  (Best practice is for client connections to   


# be pooled anyway.) Only do so on Linux where it is known to be


# supported.


# u34 and greater need 180k 


JVM_OPTS=$JVM_OPTS -Xss180k

What value should I use? Java defaults at 400K? Maybe try that first. 

Thanks.
-Wei

- Original Message -
From: Derek Williams de...@fyrie.net
To: user@cassandra.apache.org, Wei Zhu wz1...@yahoo.com
Sent: Thursday, January 24, 2013 11:06:00 PM
Subject: Re: Cassandra pending compaction tasks keeps increasing


Increasing the stack size in cassandra-env.sh should help you get past the 
stack overflow. Doesn't help with your original problem though. 



On Fri, Jan 25, 2013 at 12:00 AM, Wei Zhu  wz1...@yahoo.com  wrote: 


Well, even after restart, it throws the the same exception. I am basically 
stuck. Any suggestion to clear the pending compaction tasks? Below is the end 
of stack trace: 

at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$1.iterator(Sets.java:578) 
at com.google.common.collect.Sets$3.iterator(Sets.java:667) 
at com.google.common.collect.Sets$3.size(Sets.java:670) 
at com.google.common.collect.Iterables.size(Iterables.java:80) 
at org.apache.cassandra.db.DataTracker.buildIntervalTree(DataTracker.java:557) 
at 
org.apache.cassandra.db.compaction.CompactionController.init(CompactionController.java:69)
 
at 
org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:105)
 
at 
org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50)
 
at 
org.apache.cassandra.db.compaction.CompactionManager$1.runMayThrow(CompactionManager.java:154)
 
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) 
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) 
at java.util.concurrent.FutureTask$Sync.innerRun(Unknown Source) 
at java.util.concurrent.FutureTask.run(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Unknown Source) 
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) 
at java.lang.Thread.run(Unknown Source) 

Any suggestion is very much appreciated 

-Wei 



- Original Message - 
From: Wei Zhu  wz1...@yahoo.com  
To: user@cassandra.apache.org 
Sent: Thursday, January 24, 2013 10:55:07 PM 
Subject: Re: Cassandra pending compaction tasks keeps increasing 

Do you mean 90% of the reads should come from 1 SSTable? 

By the way, after I finished the data migrating, I ran nodetool repair -pr on 
one of the nodes. Before nodetool repair, all the nodes have the same disk 
space usage. After I ran the nodetool repair, the disk space for that node 
jumped from 135G to 220G, also there are more than 15000 pending compaction 
tasks. After a while , Cassandra started to throw the exception like below and 
stop compacting. I had to restart the node. By the way, we are using 1.1.7. 
Something doesn't seem right. 


INFO [CompactionExecutor:108804] 2013-01-24 22:23:10,427 CompactionTask.java 
(line 109) Compacting 
[SSTableReader(path='/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-753782-Data.db')]
 
INFO [CompactionExecutor:108804] 2013-01-24 22:23:11,610 CompactionTask.java 
(line 221) Compacted to 
[/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754996-Data.db,]. 5,259,403 
to 5,259,403 (~100% of original) bytes for 1,983 keys at 4.268730MB/s. Time: 
1,175ms. 
INFO [CompactionExecutor:108805] 2013-01-24 22:23:11,617 CompactionTask.java 
(line 109) Compacting 
[SSTableReader(path='/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754880-Data.db')]
 
INFO [CompactionExecutor:108805] 2013-01-24 22:23:12,828 CompactionTask.java 
(line 221) Compacted to 
[/ssd/cassandra/data/zoosk/friends/zoosk-friends-hf-754997-Data.db,]. 5,272,746 
to 5,272,746 (~100% of original) bytes for 1,941 keys at 4.152339MB/s. Time: 
1,211ms. 
ERROR [CompactionExecutor:108806] 2013-01-24 22:23:13,048

Re: Is this how to read the output of nodetool cfhistograms?

2013-01-22 Thread Wei Zhu

I agree that Cassandra cfhistograms is probably the most bizarre metrics I have 
ever come across although it's extremely useful. 

I believe the offset is actually the metrics it has tracked (x-axis on the 
traditional histogram) and the number under each column is how many times that 
value has been recorded (y-axis on the traditional histogram). Your write 
latency are 17, 20, 24 (microseconds?). 3 writes took 17, 7 writes took 20 and 
19 writes took 24

Correct me if I am wrong.

Thanks.
-Wei



 From: Brian Tarbox tar...@cabotresearch.com
To: user@cassandra.apache.org 
Sent: Tuesday, January 22, 2013 7:27 AM
Subject: Re: Is this how to read the output of nodetool cfhistograms?
 

Indeed, but how many Cassandra users have the good fortune to stumble across 
that page?  Just saying that the explanation of the very powerful nodetool 
commands should be more front and center.

Brian



On Tue, Jan 22, 2013 at 10:03 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

This was described in good detail here:



http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/


On Tue, Jan 22, 2013 at 9:41 AM, Brian Tarbox tar...@cabotresearch.com wrote:

Thank you!   Since this is a very non-standard way to display data it might be 
worth a better explanation in the various online documentation sets.


Thank you again.


Brian



On Tue, Jan 22, 2013 at 9:19 AM, Mina Naguib mina.nag...@adgear.com wrote:



On 2013-01-22, at 8:59 AM, Brian Tarbox tar...@cabotresearch.com wrote:

 The output of this command seems to make no sense unless I think of it as 
 5 completely separate histograms that just happen to be displayed together.

 Using this example output should I read it as: my reads all took either 1 
 or 2 sstable.  And separately, I had write latencies of 3,7,19.  And 
 separately I had read latencies of 2, 8,69, etc?

 In other words...each row isn't really a row...i.e. on those 16033 reads 
 from a single SSTable I didn't have 0 write latency, 0 read latency, 0 row 
 size and 0 column count.  Is that right?

Correct.  A number in any of the metric columns is a count value bucketed in 
the offset on that row.  There are no relationships between other columns on 
the same row.

So your first row says 16033 reads were satisfied by 1 sstable.  The other 
metrics (for example, latency of these reads) is reflected in the histogram 
under Read Latency, under various other bucketed offsets.



 Offset      SSTables     Write Latency      Read Latency          Row Size 
      Column Count
 1              16033             0                            0            
                 0                 0
 2                303               0                            0          
                   0                 1
 3                  0                 0                            0        
                     0                 0
 4                  0                 0                            0        
                     0                 0
 5                  0                 0                            0        
                     0                 0
 6                  0                 0                            0        
                     0                 0
 7                  0                 0                            0        
                     0                 0
 8                  0                 0                            2        
                     0                 0
 10                 0                 0                            0        
                     0              6261
 12                 0                 0                            2        
                     0               117
 14                 0                 0                            8        
                     0                 0
 17                 0                 3                           69        
                     0               255
 20                 0                 7                          163        
                     0                 0
 24                 0                19                         1369        
                     0                 0

Re: Cassandra pending compaction tasks keeps increasing

2013-01-22 Thread Wei Zhu

Thanks Aaron and Jim for your reply. The data import is done. We have about 
135G on each node and it's about 28K SStables. For normal operation, we only 
have about 90 writes per seconds, but when I ran nodetool compationstats, it 
remains at 9 and hardly changes. I guess it's just an estimated number.

When I ran histogram,

Offset      SSTables     Write Latency      Read Latency          Row Size      
Column Count
1               2644                 0                 0                 0      
    18660057
2               8204                 0                 0                 0      
     9824270
3              11198                 0                 0                 0      
     6968475
4               4269                 6                 0                 0      
     5510745
5                517                29                 0                 0      
     4595205


You can see about half of the reads result in 3 SSTables. Majority of read 
latency are under 5ms, only a dozen are over 10ms. We haven't fully turn on 
reads yet, only 60 reads per second. We see about 20 read timeout during the 
past 12 hours. Not a single warning from Cassandra Log. 

Is it normal for Cassandra to timeout some requests? We set rpc timeout to be 
1s, it shouldn't time out any of them?

Thanks.
-Wei



 From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org 
Sent: Monday, January 21, 2013 12:21 AM
Subject: Re: Cassandra pending compaction tasks keeps increasing
 

The main guarantee LCS gives you is that most reads will only touch 1 row 
http://www.datastax.com/dev/blog/when-to-use-leveled-compaction

If compaction is falling behind this may not hold.

nodetool cfhistograms tells you how many SSTables were read from for reads.  
It's a recent histogram that resets each time you read from it. 

Also, parallel levelled compaction in 1.2 
http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/01/2013, at 7:49 AM, Jim Cistaro jcist...@netflix.com wrote:

1) In addition to iostat, dstat is a good tool to see wht kind of disck 
throuput your are getting.  That would be one thing to monitor.
2) For LCS, we also see pending compactions skyrocket.  During load, LCS will 
create a lot of small sstables which will queue up for compaction.
3) For us the biggest concern is not how high the pending count gets, but how 
often it gets back down near zero.  If your load is something you can do in 
segments or pause, then you can see how fast the cluster recovers on the 
compactions.
4) One thing which we tune per cluster is the size of the files.  Increasing 
this from 5MB can sometimes improve things.  But I forget if we have ever 
changed this after starting data load.


Is your cluster receiving read traffic during this data migration? If so, I 
would say that read latency is your best measure.  If the high number of 
SSTables waiting to compact is not hurting your reads, then you are probably 
ok.  Since you are on SSD, there is a good chance the compactions are not 
hurting you.  As for compactionthroughput, we set ours high for SSD.  You 
usually wont use it all because the compactions are usually single threaded.  
Dstat will help you measure this.


I hope this helps,
jc

From: Wei Zhu wz1...@yahoo.com
Reply-To: user@cassandra.apache.org user@cassandra.apache.org, Wei Zhu 
wz1...@yahoo.com
Date: Friday, January 18, 2013 12:10 PM
To: Cassandr usergroup user@cassandra.apache.org
Subject: Cassandra pending compaction tasks keeps increasing



Hi,
When I run nodetool compactionstats


I see the number of pending tasks keep going up steadily. 


I tried to increase the  compactionthroughput, by using


nodetool setcompactionthroughput


I even tried the extreme to set it to 0 to disable the throttling. 


I checked iostats and we have SSD for data, the disk util is less than 5% 
which means it's not I/O bound, CPU is also less than 10%


We are using levelcompaction and in the process of migrating data. We have 
4500 writes per second and very few reads. We have about 70G data now and will 
grow to 150G when the migration finishes. We only have one CF and right now 
the number of  SSTable is around 15000, write latency is still under 0.1ms. 


Anything needs to be concerned? Or anything I can do to reduce the number of 
pending compaction?


Thanks.
-Wei

Cassandra pending compaction tasks keeps increasing

2013-01-18 Thread Wei Zhu

Hi,
When I run nodetool compactionstats

I see the number of pending tasks keep going up steadily. 

I tried to increase the  compactionthroughput, by using

nodetool setcompactionthroughput

I even tried the extreme to set it to 0 to disable the throttling. 

I checked iostats and we have SSD for data, the disk util is less than 5% which 
means it's not I/O bound, CPU is also less than 10%

We are using levelcompaction and in the process of migrating data. We have 4500 
writes per second and very few reads. We have about 70G data now and will grow 
to 150G when the migration finishes. We only have one CF and right now the 
number of  SSTable is around 15000, write latency is still under 0.1ms. 

Anything needs to be concerned? Or anything I can do to reduce the number of 
pending compaction?

Thanks.
-Wei

Re: How many BATCH inserts in to many?

2013-01-13 Thread Wei Zhu

Another potential issue is when some failure happens to some of the mutations. 
Is atomic batches in 1.2 designed to resolve this?

http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2

-Wei

- Original Message -
From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org
Sent: Sunday, January 13, 2013 7:57:56 PM
Subject: Re: How many BATCH inserts in to many?

With regard to a large number of records in a batch mutation there are some 
potential issues. 

Each row becomes a task in the write thread pool on each replica. If a single 
client sends 1,000 rows in a mutation it will take time for the (default) 32 
threads in the write pool to work through the mutations. While they are doing 
this other clients / requests will appear to be starved / stalled. 

There are also issues with the max message size in thrift and cql over thrift. 

IMHO as a rule of thumb dont go over a few hundred if you have a high number of 
concurrent writers. 

Cheers 

- 
Aaron Morton 
Freelance Cassandra Developer 
New Zealand 

@aaronmorton 
http://www.thelastpickle.com 

On 14/01/2013, at 12:56 AM, Radim Kolar  h...@filez.com  wrote: 

do not use cassandra for implementing queueing system with high throughput. It 
does not scale because of tombstone management. Use hornetQ, its amazingly fast 
broker but it has quite slow persistence if you want to create queues 
significantly larger then your memory and use selectors for searching for 
specific messages in them. 

My point is for implementing queue message broker is what you want.

Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Wei Zhu

I tried to registered and got the following page and haven't received email 
yet. I registered 10 minutes ago.

Thank you for registering to attend:

Is My App a Good Fit for Apache Cassandra?

Details about this webinar have also been sent to your email, including a link 
to the webinar's URL.


Webinar Description:

Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra 
as he examines the types of applications that are suited to be built on 
top of Cassandra. Eric will talk about the key considerations for 
designing and deploying your application on Apache Cassandra. 

How come it's saying Is My App a Good Fit for Apache Cassandra? which was the 
previous webniar. 

Thanks.
-Wei



 From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Thursday, December 13, 2012 7:23 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra
 

It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote:
 FWIW --
 I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
 http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html

 I hope to make CQL part of the presentation and show how it integrates
 with the Java APIs.
 If you are interested, drop in.

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42

Re: Datastax C*ollege Credit Webinar Series : Create your first Java App w/ Cassandra

2012-12-13 Thread Wei Zhu

Never mind, the email arrived after 15 minutes or so...

 From: Wei Zhu wz1...@yahoo.com
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Thursday, December 13, 2012 10:06 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra

I tried to registered and got the following page and haven't received email 
yet. I registered 10 minutes ago.

Thank you for registering to attend:

Is My App a Good Fit for Apache Cassandra?

Details about this webinar have also been sent to your email, including a link 
to the webinar's URL.

Webinar Description:

Join Eric Lubow, CTO of Simple Reach and DataStax MVP for Apache Cassandra 
as he examines the types of applications that are suited to be built on 
top of Cassandra. Eric will talk about the key considerations for 
designing and deploying your application on Apache Cassandra. 

How come it's saying Is My App a Good Fit for Apache Cassandra? which was the 
previous webniar. 

Thanks.
-Wei

 From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Thursday, December 13, 2012 7:23 AM
Subject: Re: Datastax C*ollege Credit Webinar Series : Create your first Java 
App w/ Cassandra

It should be good stuff. Brian eats this stuff for lunch.

On Wednesday, December 12, 2012, Brian O'Neill b...@alumni.brown.edu wrote:
 FWIW --
 I'm presenting tomorrow for the Datastax C*ollege Credit Webinar Series:
 http://brianoneill.blogspot.com/2012/12/presenting-for-datastax-college-credit.html

 I hope to make CQL part of the presentation and show how it integrates
 with the Java APIs.
 If you are interested, drop in.

 -brian

 --
 Brian ONeill
 Lead Architect, Health Market Science (http://healthmarketscience.com)
 mobile:215.588.6024
 blog: http://brianoneill.blogspot.com/
 twitter: @boneill42

multiget_slice SlicePredicate

2012-12-10 Thread Wei Zhu

I know it's probably not a good idea to use multiget, but for my use case, it's 
the only choice,

I have question regarding the SlicePredicate argument of the multiget_slice


The SlicePredicate takes slice_range which takes start, end and range. I 
suppose start and end will apply to each individual row. How about range, is it 
a accumulative column count of all the rows or to the individual row? 
If I set range to 100, is it 100 columns per row, or total?

Thanks for you reply,
-Wei


multiget_slice
* 
mapstring,listColumnOrSuperColumn multiget_slice(listbinary keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level)

Re: multiget_slice SlicePredicate

2012-12-10 Thread Wei Zhu

Well, not sure how parallel is multiget. Someone is saying it's in parallel 
sending requests to the different nodes and on each node it's executed 
sequentially. I didn't bother looking into the source code yet. Anyone knows it 
for sure?

I am using Hector, just copied the thrift definition from Cassandra site for 
reference.

You are right, the count is for each individual row.

Thanks.
-Wei 



 From: Hiller, Dean dean.hil...@nrel.gov
To: user@cassandra.apache.org user@cassandra.apache.org; Wei Zhu 
wz1...@yahoo.com 
Sent: Monday, December 10, 2012 1:13 PM
Subject: Re: multiget_slice SlicePredicate
 
What's wrong with multiget…parallel performance is great from multiple disks 
and so usually that is a good thing.

Also, something looks wrong, since you have listbinary keys, I would expect 
the Map to be Mapbinary, listColumnOrSuperColumn

Are you sure you have that correct?  IF you set range to 100, it should be 100 
columns each row but it never hurts to run the code and verify.

Later,
Dean
PlayOrm Developer


From: Wei Zhu wz1...@yahoo.commailto:wz1...@yahoo.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org, Wei Zhu 
wz1...@yahoo.commailto:wz1...@yahoo.com
Date: Monday, December 10, 2012 2:07 PM
To: Cassandr usergroup 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: multiget_slice SlicePredicate

I know it's probably not a good idea to use multiget, but for my use case, it's 
the only choice,

I have question regarding the SlicePredicate argument of the multiget_slice


The SlicePredicate takes slice_range which takes start, end and range. I 
suppose start and end will apply to each individual row. How about range, is it 
a accumulative column count of all the rows or to the individual row?
If I set range to 100, is it 100 columns per row, or total?

Thanks for you reply,
-Wei

multiget_slice

*
mapstring,listColumnOrSuperColumn multiget_slice(listbinary keys, 
ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel 
consistency_level)

Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered compaction.

2012-12-06 Thread Wei Zhu

I think Aaron meant 300-400GB instead of 300-400MB.

Thanks.
-Wei

- Original Message -
From: Wade L Poziombka wade.l.poziom...@intel.com
To: user@cassandra.apache.org
Sent: Thursday, December 6, 2012 6:53:53 AM
Subject: RE: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered 
compaction.

“ Having so much data on each node is a potential bad day.” 

Is this discussed somewhere on the Cassandra documentation (limits, practices 
etc)? We are also trying to load up quite a lot of data and have hit memory 
issues (bloom filter etc.) in 1.0.10. I would like to read up on big data usage 
of Cassandra. Meaning terabyte size databases. 

I do get your point about the amount of time required to recover downed node. 
But this 300-400MB business is interesting to me. 

Thanks in advance. 

Wade 

From: aaron morton [mailto:aa...@thelastpickle.com] 
Sent: Wednesday, December 05, 2012 9:23 PM 
To: user@cassandra.apache.org 
Subject: Re: Freeing up disk space on Cassandra 1.1.5 with Size-Tiered 
compaction. 

Basically we were successful on two of the nodes. They both took ~2 days and 11 
hours to complete and at the end we saw one very large file ~900GB and the rest 
much smaller (the overall size decreased). This is what we expected! 

I would recommend having up to 300MB to 400MB per node on a regular HDD with 
1GB networking. 

But on the 3rd node, we suspect major compaction didn't actually finish it's 
job… 

The file list looks odd. Check the time stamps, on the files. You should not 
have files older than when compaction started. 

8GB heap 

The default is 4GB max now days. 

1) Do you expect problems with the 3rd node during 2 weeks more of operations, 
in the conditions seen below? 

I cannot answer that. 

2) Should we restart with leveled compaction next year? 

I would run some tests to see how it works for you workload. 

4) Should we consider increasing the cluster capacity? 

IMHO yes. 

You may also want to do some experiments with turing compression on if it not 
already enabled. 

Having so much data on each node is a potential bad day. If instead you had to 
move or repair one of those nodes how long would it take for cassandra to 
stream all the data over ? (Or to rsync the data over.) How long does it take 
to run nodetool repair on the node ? 

With RF 3, if you lose a node you have lost your redundancy. It's important to 
have a plan about how to get it back and how long it may take. 

Hope that helps. 

- 

Aaron Morton 

Freelance Cassandra Developer 

New Zealand 

@aaronmorton 

http://www.thelastpickle.com 

On 6/12/2012, at 3:40 AM, Alexandru Sicoe  adsi...@gmail.com  wrote: 

Hi guys, 
Sorry for the late follow-up but I waited to run major compactions on all 3 
nodes at a time before replying with my findings. 

Basically we were successful on two of the nodes. They both took ~2 days and 11 
hours to complete and at the end we saw one very large file ~900GB and the rest 
much smaller (the overall size decreased). This is what we expected! 

But on the 3rd node, we suspect major compaction didn't actually finish it's 
job. First of all nodetool compact returned much earlier than the rest - after 
one day and 15 hrs. Secondly from the 1.4TBs initially on the node only about 
36GB were freed up (almost the same size as before). Saw nothing in the server 
log (debug not enabled). Below I pasted some more details about file sizes 
before and after compaction on this third node and disk occupancy. 

The situation is maybe not so dramatic for us because in less than 2 weeks we 
will have a down time till after the new year. During this we can completely 
delete all the data in the cluster and start fresh with TTLs for 1 month (as 
suggested by Aaron and 8GB heap as suggested by Alain - thanks). 

Questions: 

1) Do you expect problems with the 3rd node during 2 weeks more of operations, 
in the conditions seen below? 
[Note: we expect the minor compactions to continue building up files but never 
really getting to compacting the large file and thus not needing much 
temporarily extra disk space]. 

2) Should we restart with leveled compaction next year? 
[Note: Aaron was right, we have 1 week rows which get deleted after 1 month 
which means older rows end up in big files = to free up space with SizeTiered 
we will have no choice but run major compactions which we don't know if they 
will work provided that we get at ~1TB / node / 1 month. You can see we are at 
the limit!] 

3) In case we keep SizeTiered: 

- How can we improve the performance of our major compactions? (we left all 
config parameters as default). Would increasing compactions throughput 
interfere with writes and reads? What about multi-threaded compactions? 

- Do we still need to run regular repair operations as well? Do these also do a 
major compaction or are they completely separate operations? 

[Note:

Rename cluster

2012-11-29 Thread Wei Zhu

Hi,
I am trying to rename a cluster by following the instruction on Wiki:

Cassandra says ClusterName mismatch: oldClusterName != newClusterName and 
refuses to start
To prevent 
operator errors, Cassandra stores the name of the cluster in its system 
table.  If you need to rename a cluster for some reason, you can: 
Perform these steps on each node: 
1. Start the cassandra-cli connected locally to this node. 
2. Run the following: 
1. use system; 
2. set LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('new cluster 
name'); 
3. exit; 
3. Run nodetool flush on this node. 
4. Update the cassandra.yaml file for the cluster_name as the same as 
2b). 
5. Restart the node. 
Once all nodes have been had this operation performed and restarted, 
nodetool ring should show all nodes as UP.

Get the following error:
Connected to: Test Cluster on 10.200.128.151/9160
Welcome to Cassandra CLI version 1.1.6

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use system;
Authenticated to keyspace: system
[default@system] set 
LocationInfo[utf8('L')][utf8('ClusterName')]=utf8('General Services 
Cluster'); 
system keyspace is not user-modifiable.
InvalidRequestException(why:system keyspace is not user-modifiable.)
at 
org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:15974)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:797)
at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:781)
at org.apache.cassandra.cli.CliClient.executeSet(CliClient.java:909)
at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:222)
at 
org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:219)
at org.apache.cassandra.cli.CliMain.main(CliMain.java:346)

I have to remove the data directory in order to change the cluster name. 
Luckily it's my testing box, so no harm. Just wondering what has been changed 
not to allow the modification through cli? What is the way of changing the 
cluster name without wiping out all the data now?

Thanks.
-Wei

Re: Java high-level client

2012-11-28 Thread Wei Zhu

We are using Hector now. What is the major advantage of astyanax over Hector?

Thanks.
-Wei

 From: Andrey Ilinykh ailin...@gmail.com
To: user@cassandra.apache.org 
Sent: Wednesday, November 28, 2012 9:37 AM
Subject: Re: Java high-level client

+1

On Tue, Nov 27, 2012 at 10:10 AM, Michael Kjellman mkjell...@barracuda.com 
wrote:

Netflix has a great client

https://github.com/Netflix/astyanax

Re: Java high-level client

2012-11-28 Thread Wei Zhu

Astyanax was the son of Hector who was Cassandra's brother in greek mythology.

So son is doing better than the father:)

-Wei




 From: Michael Kjellman mkjell...@barracuda.com
To: user@cassandra.apache.org user@cassandra.apache.org 
Sent: Wednesday, November 28, 2012 11:51 AM
Subject: Re: Java high-level client
 

Lots of example code, nice api, good performance as the first things that come 
to mind why I like Astyanax better than Hector
From:  Andrey Ilinykh ailin...@gmail.com
Reply-To:  user@cassandra.apache.org user@cassandra.apache.org
Date:  Wednesday, November 28, 2012 11:49 AM
To:  user@cassandra.apache.org user@cassandra.apache.org, Wei Zhu 
wz1...@yahoo.com
Subject:  Re: Java high-level client


First at all, it is backed by Netflix. They used it production for long time, 
so it is pretty solid. Also they have nice tool (Priam) which makes cassandra 
cloud (AWS) friendly. This is important for us. 

Andrey 



On Wed, Nov 28, 2012 at 11:53 AM, Wei Zhu wz1...@yahoo.com wrote:

We are using Hector now. What is the major advantage of astyanax over Hector?


Thanks.
-Wei




From: Andrey Ilinykh ailin...@gmail.com
To: user@cassandra.apache.org 
Sent: Wednesday, November 28, 2012 9:37 AM 

Subject: Re: Java high-level client



+1 



On Tue, Nov 27, 2012 at 10:10 AM, Michael Kjellman mkjell...@barracuda.com 
wrote:

Netflix has a great client

https://github.com/Netflix/astyanax







-- 
'Like' us on Facebook for exclusive content and other resources on all 
Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook

Re: Java high-level client

2012-11-27 Thread Wei Zhu

FYI,
We are using Hector 1.0-5 which comes with cassandra-thrift 1.09 - libthrift 
0.6.1. It can work with Cassandra 1.1.6.
Totally agree it's a pain to deal with different version of libthrift. We use 
scribe for logging, a bit messy over there.

Thanks.
-Wei



 From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org 
Sent: Tuesday, November 27, 2012 8:57 AM
Subject: Re: Java high-level client
 

Hector does not require an outdated version of thift, you are likely using an 
outdated version of hector. 

Here is the long and short of it: If the thrift thrift API changes then hector 
can have compatibility issues. This happens from time to time. The main methods 
like get() and insert() have remained the same, but the CFMetaData objects have 
changed. (this causes the incompatible class stuff you are seeing).


CQLhas a different version of the same problem, the CQL syntax is version-ed. 
For example, if you try to execute a CQL3 query as a CQL2query it will likely 
fail. 

In the end your code still has to be version aware. With hector you get a 
compile time problem, with pure CQL you get a runtime problem.

I have always had the opinion the project should have shipped hector with 
Cassandra, this would have made it obvious what version is likely to work. 

The new CQL transport client is not being shipped with Cassandra either, so you 
will still have to match up the versions. Although they should be largely 
compatible some time in the near or far future one of the clients probably wont 
work with one of the servers.

Edward


On Tue, Nov 27, 2012 at 11:10 AM, Michael Kjellman mkjell...@barracuda.com 
wrote:

Netflix has a great client

https://github.com/Netflix/astyanax

On 11/27/12 7:40 AM, Peter Lin wool...@gmail.com wrote:

I use hector-client master, which is pretty stable right now.

It uses the latest thrift, so you can use hector with thrift 0.9.0.
That's assuming you don't mind using the active development branch.



On Tue, Nov 27, 2012 at 10:36 AM, Carsten Schnober
schno...@ids-mannheim.de wrote:
 Hi,
 I'm aware that this has been a frequent question, but answers are still
 hard to find: what's an appropriate Java high-level client?
 I actually believe that the lack of a single maintained Java API that is
 packaged with Cassandra is quite an issue. The way the situation is
 right now, new users have to pick more or less randomly one of the
 available options from the Cassandra Wiki and find a suitable solution
 for their individual requirements through trial implementations. This
 can cause and lot of wasted time (and frustration).

 Personally, I've played with Hector before figuring out that it seems to
 require an outdated Thrift version. Downgrading to Thrift 0.6 is not an
 option for me though because I use Thrift 0.9.0 in other classes of the
 same project.
 So I've had a look at Kundera and at Easy-Cassandra. Both seem to lack a
 real documentation beyond the examples available in their Github
 repositories, right?

 Can more experienced users recommend either one of the two or some of
 the other options listed at the Cassandra Wiki? I know that this
 strongly depends on individual requirements, but all I need are simple
 requests for very basic queries. So I would like to emphasize the
 importance a clear documentation and a stable and well-maintained API.
 Any hints?
 Thanks!
 Carsten

 --
 Institut für Deutsche Sprache | http://www.ids-mannheim.de
 Projekt KorAP                 | http://korap.ids-mannheim.de
 Tel. +49-(0)621-43740789      | schno...@ids-mannheim.de
 Korpusanalyseplattform der nächsten Generation
 Next Generation Corpus Analysis Platform


'Like' us on Facebook for exclusive content and other resources on all 
Barracuda Networks solutions.

Visit http://barracudanetworks.com/facebook

Re: row cache re-fill very slow

2012-11-19 Thread Wei Zhu

Last time I checked, it took about 120 seconds to load up 21125 keys with total 
about 500M in memory ( We have a pretty wide row:). So it's about 4 MB/sec.

Just curious Andras, how can you manage such a big row cache (10-15GB 
currently)? By recommendation, you will have 10% of your heap as row cache, so 
your heap is over 100G?? The largest datastax recommends is 8GB and it seems to 
be a hardcoded limit in cassandra-env.sh ( # calculate 1/4 ram and cap to 
8192MB). Does you GC hold up with such a big heap? In my experience, full GC 
could take over 20 seconds for such a big heap.

Thanks.
-Wei



 From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org 
Sent: Monday, November 19, 2012 1:00 PM
Subject: Re: row cache re-fill very slow
 

i was just wondering if anyone else is experiencing very slow ( ~ 3.5 MB/sec ) 
re-fill of the row cache at start up.
It was mentioned the other day.  

What version are you on ? 
Do you know how many rows were loaded ? When complete it will log a message 
with the pattern 

completed loading (%d ms; %d keys) row cache for %s.%s

How is the saved row cache file processed?
In Version 1.1, after the SSTables have been opened the keys in the saved row 
cache are read one at a time and the whole row read into memory. This is a 
single threaded operation. 

In 1.2 reading the saved cache is still single threaded, but reading the rows 
goes through the read thread pool so is in parallel.

In both cases I do not believe the cache is stored in token (or key) order. 

( Admittedly whatever is going on is still much more preferable to starting 
with a cold row cache )
row_cache_keys_to_save in yaml may help you find a happy half way point. 

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 20/11/2012, at 3:17 AM, Andras Szerdahelyi 
andras.szerdahe...@ignitionone.com wrote:

Hey list, 


i was just wondering if anyone else is experiencing very slow ( ~ 3.5 MB/sec ) 
re-fill of the row cache at start up. We operate with a large row cache ( 
10-15GB currently ) and we already measure startup times in hours :-)


How is the saved row cache file processed? Are the cached row keys simply 
iterated over and their respective rows read from SSTables - possibly creating 
random reads with small enough sstable files, if the keys were not stored in a 
manner optimised for a quick re-fill ? -  or is there a smarter algorithm ( 
i.e. scan through one sstable at a time, filter rows that should be in row 
cache )  at work and this operation is purely disk i/o bound ?


( Admittedly whatever is going on is still much more preferable to starting 
with a cold row cache )


thanks!
Andras





Andras Szerdahelyi
Solutions Architect, IgnitionOne | 1831 Diegem E.Mommaertslaan 20A
M: +32 493 05 50 88 | Skype: sandrew84



C4798BB9-9092-4145-880B-A72C6B7AF9A4[41].png

Re: unable to read saved rowcache from disk

2012-11-15 Thread Wei Zhu

Just curious why do you think row key will take 300 byte? If the row key is 
Long type, doesn't it take 8 bytes?
In his case, the rowCache was 500M with 1.6M rows, so the row data is 300B. Did 
I miss something?

Thanks.
-Wei



 From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org 
Sent: Thursday, November 15, 2012 12:15 PM
Subject: Re: unable to read saved rowcache from disk
 

For a row cache of 1,650,000:

16 byte token
300 byte row key ? 
and row data ? 
multiply by a java fudge factor or 5 or 10. 

Trying delete the saved cache and restarting.

Cheers
 



-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 15/11/2012, at 8:20 PM, Wz1975 wz1...@yahoo.com wrote:

Before shut down,  you saw rowcache has 500m, 1.6m rows,  each row average 
300B, so 700k row should be a little over 200m, unless it is reading more,  
maybe tombstone?  Or the rows on disk  have grown for some reason,  but row 
cache was not updated?  Could be something else eats up the memory.  You may 
profile memory and see who consumes the memory. 


Thanks.
-Wei

Sent from my Samsung smartphone on ATT 


 Original message 
Subject: Re: unable to read saved rowcache from disk 
From: Manu Zhang owenzhang1...@gmail.com 
To: user@cassandra.apache.org 
CC: 


3G, other jvm parameters are unchanged. 



On Thu, Nov 15, 2012 at 2:40 PM, Wz1975 wz1...@yahoo.com wrote:

How big is your heap?  Did you change the jvm parameter? 



Thanks.
-Wei

Sent from my Samsung smartphone on ATT 


 Original message 
Subject: Re: unable to read saved rowcache from disk 
From: Manu Zhang owenzhang1...@gmail.com 
To: user@cassandra.apache.org 
CC: 


add a counter and print out myself



On Thu, Nov 15, 2012 at 1:51 PM, Wz1975 wz1...@yahoo.com wrote:

Curious where did you see this? 


Thanks.
-Wei

Sent from my Samsung smartphone on ATT 



 Original message 
Subject: Re: unable to read saved rowcache from disk 

From: Manu Zhang owenzhang1...@gmail.com 
To: user@cassandra.apache.org 
CC: 


OOM at deserializing 747321th row



On Thu, Nov 15, 2012 at 9:08 AM, Manu Zhang owenzhang1...@gmail.com wrote:

oh, as for the number of rows, it's 165. How long would you expect it to 
be read back?



On Thu, Nov 15, 2012 at 3:57 AM, Wei Zhu wz1...@yahoo.com wrote:

Good information Edward. 
For my case, we have good size of RAM (76G) and the heap is 8G. So I set 
the row cache to be 800M as recommended. Our column is kind of big, so the 
hit ratio for row cache is around 20%, so according to datastax, might 
just turn the row cache altogether. 
Anyway, for restart, it took about 2 minutes to load the row cache


 INFO [main] 2012-11-14 11:43:29,810 AutoSavingCache.java (line 108) 
reading saved cache /var/lib/cassandra/saved_caches/XXX-f2-RowCache
 INFO [main] 2012-11-14 11:45:12,612 ColumnFamilyStore.java (line 451) 
completed loading (102801 ms; 21125keys) row cache for XXX.f2 


Just for comparison, our key is long, the disk usage for row cache is 
253K. (it only stores key when row cache is saved to disk, so 253KB/ 
8bytes = 31625number of keys). It's about right...
So for 15MB, there could be a lot of narrow rows. (if the key is Long, 
could be more than 1M rows)
  
Thanks.
-Wei


 From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org 
Sent: Tuesday, November 13, 2012 11:13 PM
Subject: Re: unable to read saved rowcache from disk
 

http://wiki.apache.org/cassandra/LargeDataSetConsiderations

A negative side-effect of a large row-cache is start-up time. The
periodic saving of the row cache information only saves the keys that
are cached; the data has to be pre-fetched on start-up. On a large
data set, this is probably going to be seek-bound and the time it
takes to warm up the row cache will be linear with respect to the row
cache size (assuming sufficiently large amounts of data that the seek
bound I/O is not subject to optimization by disks)

Assuming a row cache 15MB and the average row is 300 bytes, that could
be 50,000 entries. 4 hours seems like a long time to read back 50K
entries. Unless the source table was very large and you can only do a
small number / reads/sec.

On Tue, Nov 13, 2012 at 9:47 PM, Manu Zhang owenzhang1...@gmail.com 
wrote:
 incorrect... what do you mean? I think it's only 15MB, which is not 
 big.


 On Wed, Nov 14, 2012 at 10:38 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Yes the row cache could be incorrect so on startup cassandra verify 
 they
 saved row cache by re reading. It takes a long time so do not save a 
 big row
 cache.


 On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com 
 wrote:
  I have a rowcache provieded by SerializingCacheProvider.
  The data that has been read into it is about 500MB, as claimed by
  jconsole. After saving cache, it is around 15MB on disk. Hence, I

Re: unable to read saved rowcache from disk

2012-11-14 Thread Wei Zhu

Good information Edward. 
For my case, we have good size of RAM (76G) and the heap is 8G. So I set the 
row cache to be 800M as recommended. Our column is kind of big, so the hit 
ratio for row cache is around 20%, so according to datastax, might just turn 
the row cache altogether. 
Anyway, for restart, it took about 2 minutes to load the row cache

 INFO [main] 2012-11-14 11:43:29,810 AutoSavingCache.java (line 108) reading 
saved cache /var/lib/cassandra/saved_caches/XXX-f2-RowCache
 INFO [main] 2012-11-14 11:45:12,612 ColumnFamilyStore.java (line 451) 
completed loading (102801 ms; 21125keys) row cache for XXX.f2 

Just for comparison, our key is long, the disk usage for row cache is 253K. (it 
only stores key when row cache is saved to disk, so 253KB/ 8bytes = 31625number 
of keys). It's about right...
So for 15MB, there could be a lot of narrow rows. (if the key is Long, could 
be more than 1M rows)
  
Thanks.
-Wei


 From: Edward Capriolo edlinuxg...@gmail.com
To: user@cassandra.apache.org 
Sent: Tuesday, November 13, 2012 11:13 PM
Subject: Re: unable to read saved rowcache from disk
 
http://wiki.apache.org/cassandra/LargeDataSetConsiderations

A negative side-effect of a large row-cache is start-up time. The
periodic saving of the row cache information only saves the keys that
are cached; the data has to be pre-fetched on start-up. On a large
data set, this is probably going to be seek-bound and the time it
takes to warm up the row cache will be linear with respect to the row
cache size (assuming sufficiently large amounts of data that the seek
bound I/O is not subject to optimization by disks)

Assuming a row cache 15MB and the average row is 300 bytes, that could
be 50,000 entries. 4 hours seems like a long time to read back 50K
entries. Unless the source table was very large and you can only do a
small number / reads/sec.

On Tue, Nov 13, 2012 at 9:47 PM, Manu Zhang owenzhang1...@gmail.com wrote:
 incorrect... what do you mean? I think it's only 15MB, which is not big.


 On Wed, Nov 14, 2012 at 10:38 AM, Edward Capriolo edlinuxg...@gmail.com
 wrote:

 Yes the row cache could be incorrect so on startup cassandra verify they
 saved row cache by re reading. It takes a long time so do not save a big row
 cache.


 On Tuesday, November 13, 2012, Manu Zhang owenzhang1...@gmail.com wrote:
  I have a rowcache provieded by SerializingCacheProvider.
  The data that has been read into it is about 500MB, as claimed by
  jconsole. After saving cache, it is around 15MB on disk. Hence, I suppose
  the size from jconsole is before serializing.
  Now while restarting Cassandra, it's unable to read saved rowcache back.
  By unable, I mean around 4 hours and I have to abort it and remove cache
  so as not to suspend other tasks.
  Since the data aren't huge, why Cassandra can't read it back?
  My Cassandra is 1.2.0-beta2.

Re: read request distribution

2012-11-13 Thread Wei Zhu

I am new to Cassandra, and 1.1.6 is the only version I have tested. Not sure 
about the old behavior, for 1.1.6, my observation is that for brand new cluster 
(with no CF created), it shows Ownership from nodetool ring, the value 
is 100/nodes. As soon as one CF is created, the column changes to Effective 
Ownership and the formula seems to be 100*replication factor/nodes as Kirk 
mentioned. .Theoretically, different keyspace can have different replication 
factor. Not sure how Effective Ownership is calculated in that cases. Just 
curious anyone knows?

Thanks.
-Wei



 From: Kirk True k...@mustardgrain.com
To: user@cassandra.apache.org 
Sent: Monday, November 12, 2012 4:24 PM
Subject: Re: read request distribution
 

 
Somewhat recently the Ownership column was changed to Effective Ownership. 

 
Previously the formula was essentially 100/nodes. Now it's 100*replication 
factor/nodes. So in previous releases of Cassandra it would be 100/12 = 
8.33, now it would be closer to 25% (8.33*3 (assuming a replication factor of 
three)).

 
Kirk

 
On Mon, Nov 12, 2012, at 03:52 PM, Ananth Gundabattula wrote:

Hi all,

 
On an unrelated observation of the below readings, it looks like all the 3 
nodes own 100% of the data. This confuses me a bit. We have a 12 node cluster 
with RF=3 but the effective ownership is shown as 8.33 % . 

 
So here is my question. How is the ownership calculated : Is Replica factor 
considered in the ownership calculation ? ( If yes , then 8.33 % ownership of 
a cluster seems wrong to me . If not 100% ownership for a node cluster seems 
wrong to me. Am I missing something in the calculation? 

 
Regards,

Ananth

 
On Fri, Nov 9, 2012 at 4:37 PM, Wei Zhu wz1...@yahoo.com wrote:

Hi All,

I am doing a benchmark on a Cassandra. I have a three node cluster with RF=3. 
I generated 6M rows with sequence  number from 1 to 6m, so the rows should be 
evenly distributed among the three nodes disregarding the replicates. 

I am doing a benchmark with read only requests, I generate read request for 
randomly generated keys from 1 to 6M. Oddly, nodetool cfstats, reports that 
one node has only half the requests as the other one and the third node sits 
in the middle. So the ratio is like 2:3:4. The node with the most read 
requests actually has the smallest latency and the one with the least read 
requests reports the largest latency. The difference is pretty big, the 
fastest is almost double the slowest.

All three nodes have the exactly the same hardware and the data size on each 
node are the same since the RF is three and all of them have the complete 
data. I am using Hector as client and the random read request are in 
millions. I can't think of a reasonable explanation.  Can someone please shed 
some lights?

 
Thanks.
-Wei

Re: monitor cassandra 1.1.6 with MX4J

2012-11-12 Thread Wei Zhu

In my cassandra-env.sh for 1.1.6, there is no setting regarding mx4j at all. I 
simply dropped the mx4j jar to the lib folder and enable jmx from 
cassandra-env.sh, I can connect to the default mx4j port 8081 with no problem. 
I guess without the mx4j setting, it uses default port. If youi need to connect 
using other port, you might have to add the settings mentioned by Michal, I 
didn't try it myself though. 

Thanks.
-Wei



 From: Michal Michalski mich...@opera.com
To: user@cassandra.apache.org 
Sent: Monday, November 12, 2012 3:37 AM
Subject: Re: monitor cassandra 1.1.6 with MX4J
 
Hmm... It looks like it wasn't merged at some time (why?), because I can see 
that appropriate lines were present in a few branches. I didn't check if it 
works, but looking at git history tells me that you could try modifying 
cassandra-env.sh like this:

Add this somewhere  configure:

# To use mx4j, an HTML interface for JMX, add mx4j-tools.jar to the lib/ 
directory.
# By default mx4j listens on 0.0.0.0:8081. Uncomment the following lines to 
control
# its listen address and port.
MX4J_ADDRESS=-Dmx4jaddress=0.0.0.0
MX4J_PORT=-Dmx4jport=8081

And then add this:

JVM_OPTS=$JVM_OPTS $MX4J_ADDRESS
JVM_OPTS=$JVM_OPTS $MX4J_PORT

somewhere in the end of the file where JVM_OPTS is appended.

Regards,
Michał

W dniu 12.11.2012 12:03, Francisco Trujillo Hacha pisze:
 I am trying to monitor a single Cassandra node using:
 
 http://wiki.apache.org/cassandra/Operations#Monitoring_with_MX4J
 
 But I can't find the variables indicated (url and port) in the
 cassandra-env.sh
 
 Where coulde be?

Re: read request distribution

2012-11-12 Thread Wei Zhu

Thanks Tyler for the information. From the online document:

QUORUM Returns the record with the most recent timestamp after
a quorum of replicas has responded. 

It's hard to know that digest query will be sent to *one* other replica. When 
the node gets the request, does it become the coordinator since all the nodes 
have all the data in this setting? Or it will send the query to the primary 
node (the node who is in charge of token) and let that node be the coordinator? 
I would guess the latter is the case, otherwise it can't explain why the third 
node is always slower than the other two given the fact it's in charge of the 
wider columns than the other two.

Thanks.
-Wei



 From: Tyler Hobbs ty...@datastax.com
To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com 
Sent: Saturday, November 10, 2012 3:15 PM
Subject: Re: read request distribution
 

When you read at quorum, a normal read query will be sent to one replica 
(possibly the same node that's coordinating) and a digest query will be sent to 
*one* other replica, not both.  Which replicas get picked for these is 
determined by the dynamic snitch, which will favor replicas that are responding 
with the lowest latency.  That's why you'll see more queries going to replicas 
with lower latencies.

The Read Count number in nodetool cfstats is for local reads, not coordination 
of a read request.




On Fri, Nov 9, 2012 at 8:16 PM, Wei Zhu wz1...@yahoo.com wrote:

I think the row whose row key falls into the token range of the high latency 
node is likely to have more columns than the other nodes.  I have three nodes 
with RF = 3, so all the nodes have all the data. And CL = Quorum, meaning each 
request is sent to all three nodes and response is sent back to client when two 
of them respond. What exactly does Read Count from nodetool cfstats mean 
then, should it be the same across all the nodes? I checked with Hector, it 
uses Round Robin LB strategy. And I also tested writes, and the writes are 
distributed across the cluster evenly. Below is the output from nodetool. Any 
one has a clue what might happened?


Node1:
Read Count: 318679
Read Latency: 72.47641436367003 ms.
Write Count: 158680
Write Latency: 0.07918750315099571 ms.
Node 2:
Read Count: 251079 Read Latency: 86.91948475579399 ms. Write Count: 158450 
Write Latency: 0.1744694540864626 ms.
Node 3:
Read Count: 149876 Read Latency: 168.14125553123915 ms. Write Count: 157896 
Write Latency: 0.06468631250949992 ms.


 nodetool ring
Address         DC          Rack        Status State   Load            
Effective-Ownership Token                                       
                                                                               
            113427455640312821154458202477256070485     
10.1.3.152      datacenter1 rack1       Up     Normal  35.85 GB        100.00% 
            0                                           
10.1.3.153      datacenter1 rack1       Up     Normal  35.86 GB        100.00% 
            56713727820156410577229101238628035242      
10.1.3.155      datacenter1 rack1       Up     Normal  35.85 GB        100.00% 
            113427455640312821154458202477256070485     




Keyspace: benchmark:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Durable Writes: true
    Options: [replication_factor:3]


I am really confused by the Read Count number from nodetool cfstats


Really appreciate any hints.-Wei




 From: Wei Zhu wz1...@yahoo.com
To: Cassandr usergroup user@cassandra.apache.org 
Sent: Thursday, November 8, 2012 9:37 PM
Subject: read request distribution
 


Hi All,
I am doing a benchmark on a Cassandra. I have a three node cluster with RF=3. 
I generated 6M rows with sequence  number from 1 to 6m, so the rows should be 
evenly distributed among the three nodes disregarding the replicates. 

I am doing a benchmark with read only requests, I generate read request for 
randomly generated keys from 1 to 6M. Oddly, nodetool cfstats, reports that 
one node has only half the requests as the other one and the third node sits 
in the middle. So the ratio is like 2:3:4. The node with the most read 
requests actually has the smallest latency and the one with the least read 
requests reports the largest latency. The difference is pretty big, the 
fastest is almost double the slowest.

All three nodes have the exactly the same hardware and the data size on each 
node are the same since the RF is three and all of them have the complete 
data. I am using Hector as client and the random read request are in millions. 
I can't think of a reasonable explanation.  Can someone please shed some 
lights?


Thanks.
-Wei





-- 
Tyler Hobbs
DataStax

Re: read request distribution

2012-11-12 Thread Wei Zhu

That is actually my original question. All three nodes have the complete data 
and all of them have the exactly the same hardware/software configuration and 
client uses RR to distribute the read request among the nodes, why one of them 
consistently report the much larger latency than the other two?

Thanks for the explanation of QUORUM, it clears a lot of confusion.

-Wei




 From: Tyler Hobbs ty...@datastax.com
To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com 
Sent: Monday, November 12, 2012 12:43 PM
Subject: Re: read request distribution
 

Whichever node gets the initial Thrift request from the client is always the 
coordinator; there's no concept of making another node the coordinator.

As far as QUORUM goes, only two nodes need to give a response to meet the 
consistency level, so Cassandra only sends out two read requests: one data 
request, and one digest request (unless the request is selected for read 
repair, in which case it sends a digest request to two other nodes instead of 
one).  If the coordinator happens to be a replica for the requested data, it 
will usually pick itself for the data request and send the digest query 
elsewhere.

Since all of your nodes hold all of the data, I'm not sure what you're 
referring to when you say that it's in charge of the wider columns.  Because 
of dynamic snitch behavior, you should expect to see the node with the highest 
latencies get the fewest queries, even if the high latency is partially 
*caused* by it not getting many queries (i.e. cold cache).




On Mon, Nov 12, 2012 at 2:29 PM, Wei Zhu wz1...@yahoo.com wrote:

Thanks Tyler for the information. From the online document:


QUORUM Returns the record with the most recent timestamp after
a quorum of replicas has responded. 

It's hard to know that digest query will be sent to *one* other replica. When 
the node gets the request, does it become the coordinator since all the nodes 
have all the data in this setting? Or it will send the query to the primary 
node (the node who is in charge of token) and let that node be the 
coordinator? I would guess the latter is the case, otherwise it can't explain 
why the third node is always slower than the other two given the fact it's in 
charge of the wider columns than the other two.


Thanks.-Wei




 From: Tyler Hobbs ty...@datastax.com
To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com 
Sent: Saturday, November 10, 2012 3:15 PM
Subject: Re: read request distribution
 


When you read at quorum, a normal read query will be sent to one replica 
(possibly the same node that's coordinating) and a digest query will be sent 
to *one* other replica, not both.  Which replicas get picked for these is 
determined by the dynamic snitch, which will favor replicas that are 
responding with the lowest latency.  That's why you'll see more queries going 
to replicas with lower latencies.

The Read Count number in nodetool cfstats is for local reads, not coordination 
of a read request.




On Fri, Nov 9, 2012 at 8:16 PM, Wei Zhu wz1...@yahoo.com wrote:

I think the row whose row key falls into the token range of the high latency 
node is likely to have more columns than the other nodes.  I have three nodes 
with RF = 3, so all the nodes have all the data. And CL = Quorum, meaning each 
request is sent to all three nodes and response is sent back to client when 
two of them respond. What exactly does Read Count from nodetool cfstats 
mean then, should it be the same across all the nodes? I checked with Hector, 
it uses Round Robin LB strategy. And I also tested writes, and the writes are 
distributed across the cluster evenly. Below is the output from nodetool. Any 
one has a clue what might happened?


Node1:
Read Count: 318679
Read Latency: 72.47641436367003 ms.
Write Count: 158680
Write Latency: 0.07918750315099571 ms.
Node 2:
Read Count: 251079 Read Latency: 86.91948475579399 ms. Write Count: 158450 
Write Latency: 0.1744694540864626 ms.
Node 3:
Read Count: 149876 Read Latency: 168.14125553123915 ms. Write Count: 157896 
Write Latency: 0.06468631250949992 ms.


 nodetool ring
Address         DC          Rack        Status State   Load            
Effective-Ownership Token                                       
                                                                              
             113427455640312821154458202477256070485     
10.1.3.152      datacenter1 rack1       Up     Normal  35.85 GB        
100.00%             0                                           
10.1.3.153      datacenter1 rack1       Up     Normal  35.86 GB        
100.00%             56713727820156410577229101238628035242      
10.1.3.155      datacenter1 rack1       Up     Normal  35.85 GB        
100.00%             113427455640312821154458202477256070485     




Keyspace: benchmark:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Durable Writes: true
    Options

Re: read request distribution

2012-11-09 Thread Wei Zhu

I think the row whose row key falls into the token range of the high latency 
node is likely to have more columns than the other nodes.  I have three nodes 
with RF = 3, so all the nodes have all the data. And CL = Quorum, meaning each 
request is sent to all three nodes and response is sent back to client when two 
of them respond. What exactly does Read Count from nodetool cfstats mean 
then, should it be the same across all the nodes? I checked with Hector, it 
uses Round Robin LB strategy. And I also tested writes, and the writes are 
distributed across the cluster evenly. Below is the output from nodetool. Any 
one has a clue what might happened?

Node1:
Read Count: 318679
Read Latency: 72.47641436367003 ms.
Write Count: 158680
Write Latency: 0.07918750315099571 ms.
Node 2:
Read Count: 251079 Read Latency: 86.91948475579399 ms. Write Count: 158450 
Write Latency: 0.1744694540864626 ms.
Node 3:
Read Count: 149876 Read Latency: 168.14125553123915 ms. Write Count: 157896 
Write Latency: 0.06468631250949992 ms.

 nodetool ring
Address         DC          Rack        Status State   Load            
Effective-Ownership Token                                       
                                                                                
           113427455640312821154458202477256070485     
10.1.3.152      datacenter1 rack1       Up     Normal  35.85 GB        100.00%  
           0                                           
10.1.3.153      datacenter1 rack1       Up     Normal  35.86 GB        100.00%  
           56713727820156410577229101238628035242      
10.1.3.155      datacenter1 rack1       Up     Normal  35.85 GB        100.00%  
           113427455640312821154458202477256070485     


Keyspace: benchmark:
  Replication Strategy: org.apache.cassandra.locator.SimpleStrategy
  Durable Writes: true
    Options: [replication_factor:3]

I am really confused by the Read Count number from nodetool cfstats

Really appreciate any hints.
-Wei



 From: Wei Zhu wz1...@yahoo.com
To: Cassandr usergroup user@cassandra.apache.org 
Sent: Thursday, November 8, 2012 9:37 PM
Subject: read request distribution
 

Hi All,
I am doing a benchmark on a Cassandra. I have a three node cluster with RF=3. I 
generated 6M rows with sequence  number from 1 to 6m, so the rows should be 
evenly distributed among the three nodes disregarding the replicates. 

I am doing a benchmark with read only requests, I generate read request for 
randomly generated keys from 1 to 6M. Oddly, nodetool cfstats, reports that one 
node has only half the requests as the other one and the third node sits in the 
middle. So the ratio is like 2:3:4. The node with the most read requests 
actually has the smallest latency and the one with the least read requests 
reports the largest latency. The difference is pretty big, the fastest is 
almost double the slowest.

All three nodes have the exactly the same hardware and the data size on each 
node are the same since the RF is three and all of them have the complete data. 
I am using Hector as client and the random read request are in millions. I 
can't think of a reasonable explanation.  Can someone please shed some lights?

Thanks.
-Wei

read request distribution

2012-11-08 Thread Wei Zhu

Hi All,
I am doing a benchmark on a Cassandra. I have a three node cluster with RF=3. I 
generated 6M rows with sequence  number from 1 to 6m, so the rows should be 
evenly distributed among the three nodes disregarding the replicates. 

I am doing a benchmark with read only requests, I generate read request for 
randomly generated keys from 1 to 6M. Oddly, nodetool cfstats, reports that one 
node has only half the requests as the other one and the third node sits in the 
middle. So the ratio is like 2:3:4. The node with the most read requests 
actually has the smallest latency and the one with the least read requests 
reports the largest latency. The difference is pretty big, the fastest is 
almost double the slowest.

All three nodes have the exactly the same hardware and the data size on each 
node are the same since the RF is three and all of them have the complete data. 
I am using Hector as client and the random read request are in millions. I 
can't think of a reasonable explanation.  Can someone please shed some lights?

Thanks.
-Wei

Re: composite column validation_class question

2012-11-08 Thread Wei Zhu

Any thoughts?

Thanks.
-Wei





 From: Wei Zhu wz1...@yahoo.com
To: Cassandr usergroup user@cassandra.apache.org 
Sent: Wednesday, November 7, 2012 12:47 PM
Subject: composite column validation_class question
 

Hi All,
I am trying to design my schema using composite column. One thing I am a bit 
confused is how to define validation_class for the composite column, or is 
there a way to define it?
for the composite column, I might insert different value based on the column 
name, for example
I will insert date for column created: 

set user[1]['7:1:100:created'] = 1351728000; 

and insert String for description

set user[1]['7:1:100:desc'] = my description; 

I don't see a way to define validation_class for composite column. Am I right?

Thanks.
-Wei

composite column validation_class question

2012-11-07 Thread Wei Zhu

Hi All,
I am trying to design my schema using composite column. One thing I am a bit 
confused is how to define validation_class for the composite column, or is 
there a way to define it?
for the composite column, I might insert different value based on the column 
name, for example
I will insert date for column created: 

set user[1]['7:1:100:created'] = 1351728000; 

and insert String for description

set user[1]['7:1:100:desc'] = my description; 

I don't see a way to define validation_class for composite column. Am I right?

Thanks.
-Wei

Create CF with composite column through CQL 3

2012-10-31 Thread Wei Zhu

I try to use CQL3 to create CF with composite columns,

 CREATE TABLE Friends (
         ...     user_id bigint,
         ...     friend_id bigint,
         ...     status int,
         ...     source int,
         ...     created timestamp,
         ...     lastupdated timestamp,
         ...     PRIMARY KEY (user_id, friend_id, status, source)
         ... );


When I check it with cli, the composite type is a bit odd, why it's defined as 
Long, Int32, Int32, UTF8, is it supposed to be Long, Long, Int32, Int32? Did 
I do something wrong?

 describe friends;
    ColumnFamily: friends
      Key Validation Class: org.apache.cassandra.db.marshal.LongType
      Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
     Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(
org.apache.cassandra.db.marshal.LongType,
org.apache.cassandra.db.marshal.Int32Type,
org.apache.cassandra.db.marshal.Int32Type,
org.apache.cassandra.db.marshal.UTF8Type)
      GC grace seconds: 864000
      Compaction min/max thresholds: 4/32
      Read repair chance: 0.1
      DC Local Read repair chance: 0.0
      Replicate on write: true
      Caching: KEYS_ONLY
      Bloom Filter FP chance: default
      Built indexes: []
      Compaction Strategy: 
org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
      Compression Options:
        sstable_compression: org.apache.cassandra.io.compress.SnappyCompressor

Thanks.
-Wei

Re: Create CF with composite column through CQL 3

2012-10-31 Thread Wei Zhu

Thanks Tristan and Sylvain, it all makes sense now. 

One follow up question regarding the composite column. It seems that for the 
where clause I can only restrict the query on the first composite column 
(friend_id, in my case). I understand it's determined by the underlining row 
storage structure. 
Is any plan to improve that to be able to search on the other composite columns 
if I don't care about the performance.

cqlsh:demo select * from friends where source = 7;
Bad Request: PRIMARY KEY part source cannot be restricted (preceding part 
status is either not restricted or by a non-EQ relation)


Thanks.
-Wei



 From: Tristan Seligmann mithra...@mithrandi.net
To: user@cassandra.apache.org; Wei Zhu wz1...@yahoo.com 
Sent: Wednesday, October 31, 2012 10:47 AM
Subject: Re: Create CF with composite column through CQL 3
 
On Wed, Oct 31, 2012 at 7:14 PM, Wei Zhu wz1...@yahoo.com wrote:
 I try to use CQL3 to create CF with composite columns,

  CREATE TABLE Friends (
          ...     user_id bigint,
          ...     friend_id bigint,
          ...     status int,
          ...     source int,
          ...     created timestamp,
          ...     lastupdated timestamp,
          ...     PRIMARY KEY (user_id, friend_id, status, source)
          ... );


 When I check it with cli, the composite type is a bit odd, why it's defined
 as Long, Int32, Int32, UTF8, is it supposed to be Long, Long, Int32, Int32?

The first component of the PRIMARY KEY (user_id) is the row key:

       Key Validation Class: org.apache.cassandra.db.marshal.LongType

The rest of the components (friend_id, status, source) are part of the
column name:

       Columns sorted by: org.apache.cassandra.db.marshal.CompositeType(
 org.apache.cassandra.db.marshal.LongType,
 org.apache.cassandra.db.marshal.Int32Type,
 org.apache.cassandra.db.marshal.Int32Type,

and the final component of of the column name is the CQL-level column name:

 org.apache.cassandra.db.marshal.UTF8Type)

In this case, it will be created or lastupdated as those are the
only columns not part of the PRIMARY KEY.
-- 
mithrandi, i Ainil en-Balandor, a faer Ambar

Re: Benifits by adding nodes to the cluster

2012-10-31 Thread Wei Zhu

I heard about virtual nodes. But it doesn't come out until 1.2. Is it easy to 
convert the existing installation to use virtual nodes?

Thanks.
-Wei 




 From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org 
Sent: Wednesday, October 31, 2012 2:23 PM
Subject: Re: Benifits by adding nodes to the cluster
 

I have been told that it's much easier to scale the cluster by doubling the
number of nodes, since no token changed needed on the existing nodes.Yup.

But if the number of nodes is substantial, it's not realistic to double it
every time.See the keynote from Jonathan Ellis or the talk on Virtual Nodes 
from Sam here 
http://www.datastax.com/events/cassandrasummit2012/presentations

virtual nodes make this sort of thing faster and easier


How easy is to add let's say 3 additional nodes to the existing
10 nodes?In that scenario would would need to move every node. 
But if you have 10 nodes you probably don't want to scale up by 3, I would 
guess 5 or 10. Scaling is not something you want to do every day. 

How easy the process is depends on the level of automation in your environment. 
For example Ops Centre can automate rebalancing nodes. 

Cheers



-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com 

On 31/10/2012, at 7:14 AM, weiz wz1...@yahoo.com wrote:

One follow up questions.
I have been told that it's much easier to scale the cluster by doubling the
number of nodes, since no token changed needed on the existing nodes.
But if the number of nodes is substantial, it's not realistic to double it
every time. How easy is to add let's say 3 additional nodes to the existing
10 nodes? I understand the process of moving around data and delete unused
data. Just want to understand from the operational point of view, how
difficult is that? We are in the processing of evaluating the nosql
solution, one important consideration is the operation cost. Any real world
experience is very much appreciated.

Thanks.
-Wei 



--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Benifits-by-adding-nodes-to-the-cluster-tp7583437p7583466.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

95 matches

Mail list logo