Re: CQL Composite Key Seen After Table Creation

2016-01-15 Thread Chris Burroughs

On 01/06/2016 04:47 PM, Robert Coli wrote:

On Wed, Jan 6, 2016 at 12:54 PM, Chris Burroughs <chris.burrou...@gmail.com>
wrote:

The problem with that approach is that manually editing the local schema
tables in live cluster is wildly dangerous. I *think* this would work:


  * Make triple sure no schema changes are happening on the cluster.

  * Update schema tables on each node --> drain --> restart


I think that would work too, and probably be lower risk than modifying on
one and trying to get the others to pull via resetlocalschema. But I agree
it seems "wildly dangerous".


We did this, and a day later it appears successful.

I am still fuzzy on how schema "changes" propagate when you edit the 
schema tables directly and am unsure if the drain/restart rain dance was 
strictly necessary, but it felt safer. (Obviously even if I was sure 
now, that would not be behavior to count on, and I hope not to need to 
do this gain.)




Re: CQL Composite Key Seen After Table Creation

2016-01-06 Thread Chris Burroughs
I work with Amir and further experimentation I can shed a little more light on 
what exactly is going on under the hood.  For background our goal is to take 
data that is currently being read and written to via thrift, switch reads to 
CQL, and then switch writes to CQL.  This is in alternative to deleting all of 
our data and starting over, or being forever struck on super old thrift clients 
(both of those options obviously suck.)  The data models involved are absurdly 
simple (and single key with a handful of static columns).

TLDR: Metadata is complicated.  What is the least dangerous way to make direct 
changes to system.schema_columnfamilies and system.schema_columns?

Anyway, given some super simple Foo and Bar column families:

create keyspace Test with  placement_strategy = 
'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = 
{replication_factor:1};
use Test;
create column family Foo with comparator = UTF8Type and 
key_validation_class=UTF8Type and column_metadata = [ {column_name: title, 
validation_class: UTF8Type}];
create column family Bar with comparator = UTF8Type and 
key_validation_class=UTF8Type;
update column family Bar with column_metadata = [ {column_name: title, 
validation_class: UTF8Type}];

(The salient difference as described by Amir is when the column_metadata is 
set; at the same time as creation or later.)

Now we can inject a little data and see that from thrift everything looks fine:

[default@Test] set Foo['testkey']['title']='mytitle';
Value inserted.
Elapsed time: 19 msec(s).
[default@Test] set Bar['testkey']['title']='mytitle';
Value inserted.
Elapsed time: 4.47 msec(s).

[default@Test] list Foo;
Using default limit of 100
Using default cell limit of 100
---
RowKey: testkey
=> (name=title, value=mytitle, timestamp=1452108082972000)

1 Row Returned.
Elapsed time: 268 msec(s).
[default@Test] list Bar;
Using default limit of 100
Using default cell limit of 100
---
RowKey: testkey
=> (name=title, value=mytitle, timestamp=1452108093739000)

1 Row Returned.
Elapsed time: 9.3 msec(s).

But from cql the Bar column does not look like the data we wrote:

cqlsh> select * from "Test"."Foo";

 key | title
-+-
 testkey | mytitle

(1 rows)


cqlsh> select * from "Test"."Bar";

 key | column1 | value| title
-+-+--+-
 testkey |   title | 0x6d797469746c65 | mytitle


It's not just that these phantom columns are ugly, cql thinks column1 is part 
of a composite primary key.  Since there **is no column1**, that renderes the 
data un-query-able with WHERE clauses.

Just to make sure it's not thrift that is doing something unexpected, the 
sstables show the expected structure:

$ ./tools/bin/sstable2json 
/data/sstables/data/Test/Foo-d3348860b4af11e5b456639406f48f1b/Test-Foo-ka-1-Data.db
 
[
{"key": "testkey",
 "cells": [["title","mytitle",1452110466924000]]}
]


$ ./tools/bin/sstable2json 
/data/sstables/data/Test/Foo-d3348860b4af11e5b456639406f48f1b/Test-Foo-ka-1-Data.db
 
[
{"key": "testkey",
 "cells": [["title","mytitle",1452110466924000]]}
]


So, what appeared as innocent variation made years ago when the thrift schema 
was written causes very different results to cql.

Digging into the schema tables shows what is going on in more detail:

> select 
> keyspace_name,columnfamily_name,column_aliases,comparator,is_dense,key_aliases,value_alias
>  from system.schema_columnfamilies where keyspace_name='Test';

 keyspace_name | columnfamily_name | column_aliases | comparator
  | is_dense | key_aliases | value_alias
---+---++ 
+--+-+-
  Test |   Bar | ["column1"]   | 
org.apache.cassandra.db.marshal.UTF8Type | True | ["key"] |   value
  Test |   Foo |  []   | 
org.apache.cassandra.db.marshal.UTF8Type |False | ["key"] |null

> select keyspace_name,columnfamily_name,column_name,validator from 
> system.schema_columns where keyspace_name='Test';

 keyspace_name | columnfamily_name | column_name | validator
---+---+-+---
  Test |   Bar | column1 |  
org.apache.cassandra.db.marshal.UTF8Type
  Test |   Bar | key |  
org.apache.cassandra.db.marshal.UTF8Type
  Test |   Bar |   title |  
org.apache.cassandra.db.marshal.UTF8Type
  Test |   Bar |   value | 
org.apache.cassandra.db.marshal.BytesType
  Test |   Foo | key |  
org.apache.cassandra.db.marshal.UTF8Type
  Test |   Foo |   title |  
org.apache.cassandra.db.marshal.UTF8Type


Now the interesting bit is that the metadata can  be manually "fixed":

UPDATE 

Re: Migration 1.2.14 to 2.0.8 causes Tried to create duplicate hard link at startup

2014-06-10 Thread Chris Burroughs

Were you able to solve or work around this problem?

On 06/05/2014 11:47 AM, Tom van den Berge wrote:

Hi,

I'm trying to migrate a development cluster from 1.2.14 to 2.0.8. When
starting up 2.0.8, I'm seeing the following error in the logs:


  INFO 17:40:25,405 Snapshotting drillster, Account to
pre-sstablemetamigration
ERROR 17:40:25,407 Exception encountered during startup
java.lang.RuntimeException: Tried to create duplicate hard link to
/Users/tom/cassandra-data/data/drillster/Account/snapshots/pre-sstablemetamigration/drillster-Account-ic-65-Filter.db
 at
org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:75)
 at
org.apache.cassandra.db.compaction.LegacyLeveledManifest.snapshotWithoutCFS(LegacyLeveledManifest.java:129)
 at
org.apache.cassandra.db.compaction.LegacyLeveledManifest.migrateManifests(LegacyLeveledManifest.java:91)
 at
org.apache.cassandra.db.compaction.LeveledManifest.maybeMigrateManifests(LeveledManifest.java:617)
 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:274)
 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)


Does anyone have an idea how to solve this?


Thanks,
Tom





Re: New node Unable to gossip with any seeds

2014-06-04 Thread Chris Burroughs
This generally means that how you are describing the see nodes address 
doesn't match how it's described in the second node seeds list in the 
correct way.


CASSANDRA-6523 has some links that might be helpful.

On 05/26/2014 12:07 AM, Tim Dunphy wrote:

Hello,

  I am trying to spin up a new node using cassandra 2.0.7. Both nodes are at
Digital Ocean. The seed node is up and running and I can telnet to port
7000 on that host from the node I'm trying to start.

[root@cassandra02 apache-cassandra-2.0.7]# telnet 10.10.1.94 7000

Trying 10.10.1.94...

Connected to 10.10.1.94.

Escape character is '^]'.

But when I start cassandra on the new node I see the following exception:


INFO 00:01:34,744 Handshaking version with /10.10.1.94

ERROR 00:02:05,733 Exception encountered during startup

java.lang.RuntimeException: Unable to gossip with any seeds

 at
org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193)

 at
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447)

 at
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:505)

 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362)

 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569)

java.lang.RuntimeException: Unable to gossip with any seeds

 at
org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193)

 at
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447)

 at
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:505)

 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362)

 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569)

Exception encountered during startup: Unable to gossip with any seeds

ERROR 00:02:05,742 Exception in thread
Thread[StorageServiceShutdownHook,5,main]

java.lang.NullPointerException

 at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1270)

 at
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:573)

 at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

 at java.lang.Thread.run(Thread.java:745)



I'm using the murmur3 partition on both nodes and I have the seed node's IP
listed in the cassandra.yaml of the new node. I'm just wondering what the
issue might be and how I can get around it.


Thanks

Tim








Re: alternative vnode upgrade strategy?

2014-06-04 Thread Chris Burroughs

On 05/28/2014 02:18 PM, William Oberman wrote:

1.) Upgrade all N nodes to vnodes in place
Start loop
2.) Boot a new node and let it bootstrap
3.) Decommission an old node
End loop


I's been a while since I had to think about the vnode migration, but 
I've think this would fall pray to 
https://issues.apache.org/jira/browse/CASSANDRA-5525


Re: Is the tarball for a given release in a Maven repository somewhere?

2014-05-22 Thread Chris Burroughs
Maven central has bin.tar.gz  src.tar.gz downloads for the 
'apache-cassandra' artifact.  Does that work for your use case?


http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22apache-cassandra%22

On 05/20/2014 05:30 PM, Clint Kelly wrote:

Hi all,

I am using the maven assembly plugin to build a project that contains
a development environment for a project that we've built at work on
top of Cassandra.  I'd like this development environment to include
the latest release of Cassandra.

Is there a maven repo anywhere that contains an artifact with the
Cassandra release in it?  I'd like to have the same Cassandra tarball
that you can download from the website be a dependency for my project.
  I can then have the assembly plugin untar it and customize some of
the conf files before taring up our entire development environment.
That way, anyone using our development environment would have access
to the various shell scripts and tools.

I poked around online and could not find what I was looking for.  Any
help would be appreciated!

Best regards,
Clint





Re: What does the rate signify for latency in the JMX Metrics?

2014-05-16 Thread Chris Burroughs
They are exponential decaying moving averages (like Unix load averages) 
of the number of events per unit of time.


http://wiki.apache.org/cassandra/Metrics might help

On 04/17/2014 06:06 PM, Redmumba wrote:

Good afternoon,

I'm attempting to integrate the metrics generated via JMX into our internal
framework; however, the information for several of the metrics includes a
One/Five/Fifteen-minute rate, with the RateUnit in SECONDS.  For
example:

$get -b

org.apache.cassandra.metrics:name=Latency,scope=Write,type=ClientRequest *
#mbean =
org.apache.cassandra.metrics:name=Latency,scope=Write,type=ClientRequest:
LatencyUnit = MICROSECONDS;

EventType = calls;

RateUnit = SECONDS;

MeanRate = 383.6944837362387;

FifteenMinuteRate = 868.8420188648543;

FiveMinuteRate = 817.5239450236011;

OneMinuteRate = 675.7673129014964;

Max = 498867.0;

Count = 31257426;

Min = 52.0;

50thPercentile = 926.0;

Mean = 1063.114029159023;

StdDev = 1638.1542477604232;

75thPercentile = 1064.75;

95thPercentile = 1304.55;

98thPercentile = 1504.39992;

99thPercentile = 2307.35104;

999thPercentile = 10491.8502;



What does the rate signify in this context?  For example, given the
OneMinuteRate of  675.7673129014964 and the unit of seconds--what is this
measuring?  Is this the rate of which metrics are submitted? i.e., there
were an average of (676 * 60 seconds) metrics submitted over the last
minute?

Thanks!





Re: Backup procedure

2014-05-16 Thread Chris Burroughs
It's also good to note that only the Data files are compressed already. 
 Depending on your data the Index and other files may be a significant 
percent of total on disk data.


On 05/02/2014 01:14 PM, tommaso barbugli wrote:

In my tests compressing with lzop sstables (with cassandra compression
turned on) resulted in approx. 50% smaller files.
Thats probably because the chunks of data compressed by lzop are way bigger
than the average size of writes performed on Cassandra (not sure how data
is compressed but I guess it is done per single cell so unless one stores)


2014-05-02 19:01 GMT+02:00 Robert Coli rc...@eventbrite.com:


On Fri, May 2, 2014 at 2:07 AM, tommaso barbugli tbarbu...@gmail.comwrote:


If you are thinking about using Amazon S3 storage I wrote a tool that
performs snapshots and backups on multiple nodes.
Backups are stored compressed on S3.
https://github.com/tbarbugli/cassandra_snapshotter



https://github.com/JeremyGrosser/tablesnap

SSTables in Cassandra are compressed by default, if you are re-compressing
them you may just be wasting CPU.. :)

=Rob








Re: row caching for frequently updated column

2014-05-14 Thread Chris Burroughs

You are close.

On 04/30/2014 12:41 AM, Jimmy Lin wrote:

thanks all for the pointers.

let' me see if I can put the sequences of event together 

1.2
people mis-understand/mis-use row cache, that cassandra cached the entire
row of data even if you are only looking for small subset of the row data.
e.g
select single_column from a_wide_row_table
will result in entire row cached even if you are only interested in one
single column of a row.



Yep!


2.0
and because of potential misuse of heap memory, Cassandra 2.0 remove heap
cache, and only support off-heap cache, which has a side effect that write
will invalidate the row cache(my original question)



off-heap is a common but misleading name for the 
SerializingCacheProvider.  It still stores several objects on heap per 
cached item and has to deser on read.



2.1
the coming 2.1 Cassandra will offer true cache by query, so the cached data
will be much more efficient even for wide rows(it cached what it needs).

do I get it right?
for the new 2.1 row caching, is it still true that a write or update to the
row will still invalidate the cached row ?



I don't think true cache by query is an accurate description of 
CASSANDRA-5357.  I think it's more like a head of the row cache.




Re: Thrift Server Implementations

2014-03-05 Thread Chris Burroughs

On 02/13/2014 01:37 PM, Christopher Wirt wrote:

Anyway, today I moved the old HsHa implementation and the new
TThreadSelectorServer into a 2.0.5 checkout, hooked them in, built, did a
bit of testing and I'm now running live.



We found the TThreadSelectorServer performed the best getting us back under
our SLA.


Are you still running with the upstream TThreadSelectorServer?  Based on 
your experience is there any reason Cassandra should not adapt it.


Re: mixed nodes, some SSD some HD

2014-03-05 Thread Chris Burroughs
No.  If you have a heterogeneous clusters you should consider adjusting 
the number of vnodes per physical node.


On 03/04/2014 10:47 PM, Elliot Finley wrote:

Using Cassandra 2.0.x

If I have a 3 node cluster and 2 of the nodes use spinning drives and 1 of
them uses SSD,  will the majority of the reads be routed to the SSD node
automatically because it has faster responses?

TIA,
Elliot





Re: ring describe returns only public ips

2014-02-10 Thread Chris Burroughs
More generally, a thrift api or other mechanism for Astyanax to get the 
INTERNAL_IP seems necessary to use ConnectionPoolType.TOKEN_AWARE + 
NodeDiscoveryType.TOKEN_AWARE in a multi-dc setup.  Absent one I'm 
confused how that combination is possible.


On 02/06/2014 03:17 PM, Ted Pearson wrote:

We are using Cassandra 1.2.13 in a multi-datacenter setup. We are using 
Astyanax as the client, and we’d like to enable its token aware connection pool 
type and ring describe node discovery type. Unfortunately, I’ve found that both 
thrift’s describe_ring and `nodetool ring` only report the public IPs of the 
cassandra nodes. This means that Astyanax tries to reconnect to the public IPs 
of each node, which doesn’t work and just results in no hosts being available 
for queries according to Astyanax.

I know from `nodetool gossipinfo` (and the fact that the clusters work) that 
it's sharing the LOCAL_IP via gossip, but have no idea how or if it’s possible 
to get describe_ring to return local IPs, or if there is some alternative.

Thanks,

-Ted





Re: Question about local reads with multiple data centers

2014-02-06 Thread Chris Burroughs

On 01/29/2014 08:07 PM, Donald Smith wrote:

My question: will the read process try to read first locally from the 
datacenter DC2 I specified in its connection string? I presume so.  (I 
doubt that it uses the client's IP address to decide which datacenter is 
closer. And I am unaware of another way to tell it to read locally.)



From the rest if this thread it looks like you were asking about how 
the client selected a Cassandra node to act as a coordinator.  Note 
however that if you are using a DC oblivious CL (ONE, QUORUM) then that 
Cassandra coordinator may send requests to the remote data center.




Also, will read repair happen between datacenters automatically 
(read_repair_chance=0.10)?  Or does that only happen within a single data 
center?


Yes read_repair_chance is global.  There is a separate dc_local repair 
chance if you want to make local reap repairs more common.


Re: what tool will create noncql columnfamilies in cassandra 3a

2014-02-06 Thread Chris Burroughs

On 02/05/2014 04:57 AM, Sylvain Lebresne wrote:

How will users adjust the meta data of non cql column families

The rational for removing cassandra-cli is mainly that maintaining 2 fully
featured command line interface is a waste of the project resources in the
long
run. It's just a tool using the thrift interface however and you'll still be
able to adjust metadata through the thrift interface as before. As Patricia
mentioned, there is even some existing interactive options like pycassaShell
in the community.


It's also wasteful for the community to maintain multiple post 3.0 forks 
for cassandra-cli so they can continue using Cassandra.  It would be 
more efficient if they cool pool their resources in a central place, 
like a code repo at Apache.


Re: First SSTable file is not being compacted

2014-02-06 Thread Chris Burroughs

On 02/06/2014 01:17 AM, Sameer Farooqui wrote:

I'm running C* 2.0.4 and when I have a handful of SSTable files and trigger
a manual compaction with 'nodetool compact' the first SSTable file doesn't
get compacted away.

Is there something special about the first SSTable that it remains even
after a SizedTierCompaction?



No, this is not expected behavior.  Do the number of live SSTables 
reported match what is on disk?  Do you have a procedure that can repeat 
this?




Re: First SSTable file is not being compacted

2014-02-06 Thread Chris Burroughs

Sounds like you have done some solid test work.

I suggest reading https://issues.apache.org/jira/browse/CASSANDRA-6568 
and if you think your issue is the same adding your reproduction case 
there, otherwise create your own ticket.


On 02/06/2014 10:53 AM, Sameer Farooqui wrote:

Yeah, it's definitely repeatable. I have a lab environment set up where the
issue is occurring and I've recreated the lab environment 4 - 5 times and
it's occurred each time.

In my demodb.users CF I currently have 2 data SSTables on disk
(demodb-users-jb-1-Data.db and demodb-users-jb-6-Data.db). However, in
OpsCenter the CF: SSTable Count (demodb.users) graph shows only one SSTable.

The nodetool cfstats command also shows SSTable count: 1 for this CF.


- SF


On Thu, Feb 6, 2014 at 8:54 AM, Chris Burroughs
chris.burrou...@gmail.comwrote:


On 02/06/2014 01:17 AM, Sameer Farooqui wrote:


I'm running C* 2.0.4 and when I have a handful of SSTable files and
trigger
a manual compaction with 'nodetool compact' the first SSTable file doesn't
get compacted away.

Is there something special about the first SSTable that it remains even
after a SizedTierCompaction?




No, this is not expected behavior.  Do the number of live SSTables
reported match what is on disk?  Do you have a procedure that can repeat
this?








Re: Row cache vs. OS buffer cache

2014-01-23 Thread Chris Burroughs

My experience has been that the row cache is much more effective.
However, reasonable row cache sizes are so small relative to RAM that I 
don't see it as a significant trade-off unless it's in a very memory 
constrained environment.  If you want to enable the row cache (a big if) 
you probably want it to be as big as it can be until you have reached 
the point of diminishing returns on the hit rate.


The off-heap cache still has many on-heap objects so it's doesn't 
really change that much conceptually, you will just end up with a 
different number for the size.


On 01/23/2014 02:13 AM, Katriel Traum wrote:

Hello list,

I was if anyone has any pointers or some advise regarding using row cache
vs leaving it up to the OS buffer cache.

I run cassandra 1.1 and 1.2 with JNA, so off-heap row cache is an option.

Any input appreciated.
Katriel





nodetool cleanup / TTL

2014-01-07 Thread Chris Burroughs
This has not reached a consensus in #cassandra in the past.  Does 
`nodetool cleanup` also remove data that has expired from a TTL?


Re: nodetool cleanup / TTL

2014-01-07 Thread Chris Burroughs

On 01/07/2014 01:38 PM, Tyler Hobbs wrote:

On Tue, Jan 7, 2014 at 7:49 AM, Chris Burroughs
chris.burrou...@gmail.comwrote:


This has not reached a consensus in #cassandra in the past.  Does
`nodetool cleanup` also remove data that has expired from a TTL?



No, cleanup only removes rows that the node is not a replica for.



Is there some other mechanism for forcing expired data to be removed 
without also compacting? (major compaction having obvious problematic 
side effects, and user defined compaction being significant work to 
script up).




Re: vnode in production

2014-01-06 Thread Chris Burroughs

On 01/02/2014 01:51 PM, Arindam Barua wrote:

1.   the stability of vnodes in production


I'm happily using vnodes in production now, but I would have trouble 
calling them stable for more than small clusters until very recently 
(1.2.13). CASSANDRA-6127 served as a master ticket for most of the 
issues if you are interested in the details.



2.   upgrading to vnodes in production


I am not aware of anyone who has succeeded with shuffle in production, 
but the 'add a new DC' procedure works.


Re: vnode in production

2014-01-06 Thread Chris Burroughs

On 01/06/2014 01:56 PM, Arindam Barua wrote:

Thanks for your responses. We are on 1.2.12 currently.
The fixes in 1.2.13 seem to help for clusters in the 500+ node range (like 
CASSANDRA-6409). Ours is below 50 now, so we plan to go ahead and enable vnodes 
with the 'add a new DC' procedure. We will try to upgrade to 1.2.13 or 1.2.14 
subsequently.


Your plan seems reasonable but in the interest of full disclosure 
CASSANDRA-6345 has been observed as a significant issue for clusters in 
the 50-75 node range.


Re: How to measure data transfer between data centers?

2013-12-04 Thread Chris Burroughs
https://wiki.apache.org/cassandra/Metrics has per node Streaming metrics 
that include total bytes/in out.  That is only a small bit of what you 
want though.


For total DC bandwidth it might be more straightforward to measure this 
at the router/switch/fancy-network-gear level.


On 12/03/2013 06:25 AM, Tom van den Berge wrote:

Is there a way to know how much data is transferred between two nodes, or
more specifically, between two data centers?

I'm especially interested in how much data is being replicated from one
data center to another, to know how much of the available bandwidth is used.


Thanks,
Tom





MiscStage Backup

2013-11-26 Thread Chris Burroughs
I'm trying to debug a node that has a backup in MiscStage.  Starting a 
bit under 24 hours ago the number of Pending tasks jumped to a bit under 
400 and hovered around there.  It looks like repair requests from other 
nodes  (tpstats on this node shows AntiEntropySessions: 0, 0, 0, which I 
think indicates it did not originate the repair).  After each MiscStage 
task completes a series of Streams are kicked off.


I am confused why MiscStage is backing up:
 (A) This node has only been down a few hours over the past week so it 
should not be wildly out of sync
 (B) no other node in this cluster has had a comparable backup of 
pending Misc stages.


Repairs are run on all nodes once a week.  Physical resources on this 
node are not particularity saturated compared to the rest of the 
cluster; reads are slower but I can't tell cause from effect in that case.


Graph of MiscStage pending tasks: http://imgur.com/sHqHTvt

This is with a 1.2.11-ish dual-DC vnode cluster.
MiscStage:1 daemon prio=10 tid=0x7f84e8598800 nid=0x43b2 waiting on 
condition [0x7f83c3734000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  0x00069d23c700 (a 
java.util.concurrent.FutureTask$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:375)
at 
org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:368)
at 
org.apache.cassandra.streaming.StreamOut.flushSSTables(StreamOut.java:108)
at 
org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:136)
at 
org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:116)
at 
org.apache.cassandra.streaming.StreamRequestVerbHandler.doVerb(StreamRequestVerbHandler.java:44)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



Re: Endless loop LCS compaction

2013-11-08 Thread Chris Burroughs

On 11/07/2013 06:48 AM, Desimpel, Ignace wrote:

Total data size is only 3.5GB. Column family was created with SSTableSize : 10 
MB


You may want to try a significantly larger size.

https://issues.apache.org/jira/browse/CASSANDRA-5727


Re: Why truncate previous hints when upgrade from 1.1.9 to 1.2.6?

2013-11-08 Thread Chris Burroughs

NEWS.txt has some details and suggested procedures

- The hints schema was changed from 1.1 to 1.2. Cassandra automatically
  snapshots and then truncates the hints column family as part of
  starting up 1.2 for the first time.  Additionally, upgraded nodes
  will not store new hints destined for older (pre-1.2) nodes. It is
  therefore recommended that you perform a cluster upgrade when all
  nodes are up. Because hints will be lost, a cluster-wide repair (with
  -pr) is recommended after upgrade of all nodes.

On 11/07/2013 07:33 AM, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 wrote:

Hi all,
When I upgrade C* from 1.1.9 to 1.2.6, I notice that the previous 
hintscolumnfamily would be directly truncated.
Can you tell me why ?
Because consistency is important to my services.


Best Regards,
Boole Guo





Re: Cassandra 1.1.6 - New node bootstrap not completing

2013-11-08 Thread Chris Burroughs

On 11/01/2013 03:03 PM, Robert Coli wrote:

On Fri, Nov 1, 2013 at 9:36 AM, Narendra Sharma
narendra.sha...@gmail.comwrote:


I was successfully able to bootstrap the node. The issue was RF  2.
Thanks again Robert.



For the record, I'm not entirely clear why bootstrapping two nodes into the
same range should have caused your specific bootstrap problem, but I am
glad to hear that bootstrapping one node at a time was a usable workaround.

=Rob



(A) If it can't work shouldn't a node refuse to bootstrap if it sees 
another node already in that state?


(B) It would be nice if nodes in independent DCs could at least be 
bootstrapped at the same time.


Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-07 Thread Chris Burroughs

On 11/06/2013 11:18 PM, Aaron Morton wrote:

The default row cache is of the JVM heap, have you changed to the 
ConcurrentLinkedHashCacheProvider ?


ConcurrentLinkedHashCacheProvider was removed in 2.0.x.


Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-06 Thread Chris Burroughs
Both caches involve several objects per entry (What do we want?  Packed 
objects.  When do we want them? Now!).  The size is an estimate of the 
off heap values only and not the total size nor number of entries.


An acceptable size will depend on your data and access patterns.  In one 
case we had a cluster that at 512mb would go into a GC death spiral 
despite plenty of free heap (presumably just due to the number of 
objects) while empirically the cluster runs smoothly at 384mb.


Your caches appear on the larger size, I suggest trying smaller values 
and only increase when it produces measurable sustained gains.


On 11/05/2013 04:04 AM, Jiri Horky wrote:

Hi there,

we are seeing extensive memory allocation leading to quite long and
frequent GC pauses when using row cache. This is on cassandra 2.0.0
cluster with JNA 4.0 library with following settings:

key_cache_size_in_mb: 300
key_cache_save_period: 14400
row_cache_size_in_mb: 1024
row_cache_save_period: 14400
commitlog_sync: periodic
commitlog_sync_period_in_ms: 1
commitlog_segment_size_in_mb: 32

-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms10G -Xmx10G
-Xmn1024M -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/data2/cassandra-work/instance-1/cassandra-1383566283-pid1893.hprof
-Xss180k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark

We have disabled row cache on one node to see  the  difference. Please
see attached plots from visual VM, I think that the effect is quite
visible. I have also taken 10x jmap -histo after 5s on a affected
server and plotted the result, attached as well.

I have taken a dump of the application when the heap size was 10GB, most
of the memory was unreachable, which was expected. The majority was used
by 55-59M objects of HeapByteBuffer, byte[] and
org.apache.cassandra.db.Column classes. I also include a list of inbound
references to the HeapByteBuffer objects from which it should be visible
where they are being allocated. This was acquired using Eclipse MAT.

Here is the comparison of GC times when row cache enabled and disabled:

prg01 - row cache enabled
   - uptime 20h45m
   - ConcurrentMarkSweep - 11494686ms
   - ParNew - 14690885 ms
   - time spent in GC: 35%
prg02 - row cache disabled
   - uptime 23h45m
   - ConcurrentMarkSweep - 251ms
   - ParNew - 230791 ms
   - time spent in GC: 0.27%

I would be grateful for any hints. Please let me know if you need any
further information. For now, we are going to disable the row cache.

Regards
Jiri Horky





Re: The performance difference of online bulk insertion and the file-based bulk loading

2013-10-23 Thread Chris Burroughs

On 10/15/2013 08:41 AM, José Elias Queiroga da Costa Araújo wrote:

- is that is there a way that we can warm-up the cache, after the
file-based bulk loading, so that we can allow the data to be cached first
in the memory, and then afterwards, when we issue the bulk retrieval, the
performance can be closer to what is provided by the online-bulk-insertion.


Somewhat hacky, but you can at least warm of the OS page cache by `cat 
FILES  /dev/null`


Re: nodetool status reporting dead node as UN

2013-10-23 Thread Chris Burroughs
When debugging gossip related problems (is this node really 
down/dead/some-werid state) you might have better luck looking at 
`nodetool gossipinfo`.  The UN even though everything is bad thing 
might be https://issues.apache.org/jira/browse/CASSANDRA-5913


I'm not sure what exactly what happened in your case.  I'm also confused 
why an IP changed on restart.


On 10/17/2013 06:12 PM, Philip Persad wrote:

Hello,

I seem to have gotten my cluster into a bit of a strange state.
Pardon the rather verbose email, but there is a fair amount of
background.  I'm running a 3 node Cassandra 2.0.1 cluster.  This
particular cluster is used only rather intermittently for dev/testing
and does not see particularly heavy use, it's mostly a catch-all
cluster for environments which don't have a dedicated cluster to
themselves.  I noticed today that one of the nodes had died because
nodetool repair was failing due to a down replica.  I run nodetool
status and sure enough, one of my nodes shows up as down.

When I looked on the actual box, the cassandra process was up and
running and everything in the logs looked sensible.  The most
controversial thing I saw was 1 CMS Garbage Collection per hour, each
taking ~250 ms.  None the less, the node was not responding, so I
restarted it.  So far so good, everything is starting up, my ~30
column families across ~6 key spaces are all initializing.  The node
then handshakes with my other two nodes and reports them both as up.
Here is where things get strange.  According to the logs on the other
two nodes, the third node has come back up and all is well.  However
in the third node, I see a wall of the following in the logs (IP
addresses masked):

  INFO [GossipTasks:1] 2013-10-17 20:22:25,652 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [GossipTasks:1] 2013-10-17 20:22:25,653 Gossiper.java (line 806)
InetAddress /x.x.x.221 is now DOWN
  INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:25,655
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
  INFO [RequestResponseStage:3] 2013-10-17 20:22:25,658 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [GossipTasks:1] 2013-10-17 20:22:26,654 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:26,657
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
  INFO [RequestResponseStage:4] 2013-10-17 20:22:26,660 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [RequestResponseStage:3] 2013-10-17 20:22:26,660 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [GossipTasks:1] 2013-10-17 20:22:27,655 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:27,660
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
  INFO [RequestResponseStage:4] 2013-10-17 20:22:27,662 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [RequestResponseStage:3] 2013-10-17 20:22:27,662 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [HANDSHAKE-/10.21.5.221] 2013-10-17 20:22:28,254
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.221
  INFO [GossipTasks:1] 2013-10-17 20:22:28,657 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [RequestResponseStage:4] 2013-10-17 20:22:28,660 Gossiper.java
(line 789) InetAddress /x.x.x.221 is now UP
  INFO [RequestResponseStage:3] 2013-10-17 20:22:28,660 Gossiper.java
(line 789) InetAddress /x.x.x.221 is now UP
  INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:28,661
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
  INFO [RequestResponseStage:4] 2013-10-17 20:22:28,663 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [GossipTasks:1] 2013-10-17 20:22:29,658 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [GossipTasks:1] 2013-10-17 20:22:29,660 Gossiper.java (line 806)
InetAddress /x.x.x.221 is now DOWN

Additional, client requests to the cluster at consistency QUORUM start
failing (saying 2 responses were required but only 1 replica
responded).  According to nodetool status, all the nodes are up.

This is clearly not good.  I take down the problem node.  Nodetool
reports it down and QUORUM client reads/writes start working again.
In an attempt to get the cluster back into a good state, I delete all
the data on the problem node and then bring it back up.  The other two
nodes log a changed host ID for the IP of the node I wiped and then
handshake with it.  The problem node also comes up, but reads/writes
start failing again with the same error.

I decide to take the problem node down again.  However this time, even
after the process is dead, nodetool and the other two nodes report
that my third node is still up and requests to the cluster continue to
fail.  Running nodetool status against either of the live nodes shows
that all nodes are up.  Running nodetool status against 

Re: Huge multi-data center latencies

2013-10-23 Thread Chris Burroughs

On 10/21/2013 07:03 PM, Hobin Yoon wrote:

Another question is how do you get the local DC name?



Have a look at org.apache.cassandra.db.EndpointSnitchInfo.getDatacenter



Re: How to use Cassandra on-node storage engine only?

2013-10-23 Thread Chris Burroughs
As far as I know this had not been done before.  I would be interested 
in hearing how it turned out.


On 10/23/2013 09:47 AM, Yasin Celik wrote:

I am developing an application for data storage. All the replication,
routing and data retrieving types of business are handled in my
application. Up to now, the data is stored in memory. Now, I want to use
Cassandra storage engine to flush data from memory into hard drive. I am
not sure if that is a correct approach.

My question: Can I use the Cassandra data storage engine only? I do not
want to use Cassandra as a whole standalone product (In this case, I should
run one independent Cassandra per node and my application act as if it is
client of Cassandra. This idea will put a lot of burden on node since it
puts unnecessary levels between my application and storage engine).

I have my own replication, ring and routing code. I only need the on-node
storage facilities of Cassandra. I want to embed cassandra in my
application as a library.





vnode + multi dc migration

2013-10-11 Thread Chris Burroughs
I know there is a good deal of interest [1] on feasible methods for 
enabling vnodes on clusters that did not start with them.


We recently completed a migration from a production cluster not using 
vnodes and in a single DC to one using vnodes in two DCs.  We used the 
just spin up a new DC and rebuild strategy instead of shuffle and it 
worked.  The checklist was long but it really wasn't more complicated 
than that.  Thanks to several people in #cassandra for suggesting the 
technique and reviewing procedures.


One oddity we noticed is that when nodes in a new DC join 
(auto_bootstrap:false) CL.ONE performance tanked [2].  The spike is when 
the nodes came online, and the drop is when reads were switched to 
CL.LOCAL_QUORUM  This only happened when the new DC was cross-continent 
(not a logical DC in the same colo).


[1] 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201308.mbox/%3CCAEDUwd12vhRJbPZpVJ6QzTOx3pwU=11hhgkkipghhgvosbj...@mail.gmail.com%3E


[2] http://i.imgur.com/ZW5Ob8V.png


Re: Multi-dc restart impact

2013-10-10 Thread Chris Burroughs

Thanks, double checked; reads are CL.ONE.

On 10/10/2013 11:15 AM, J. Ryan Earl wrote:

Are you doing QUORUM reads instead of LOCAL_QUORUM reads?


On Wed, Oct 9, 2013 at 7:41 PM, Chris Burroughs
chris.burrou...@gmail.comwrote:


I have not been able to do the test with the 2nd cluster, but have been
given a disturbing data point.  We had a disk slowly fail causing a
significant performance degradation that was only resolved when the sick
node was killed.
  * Perf in DC w/ sick disk: 
http://i.imgur.com/W1I5ymL.**png?1http://i.imgur.com/W1I5ymL.png?1
  * perf in other DC: 
http://i.imgur.com/gEMrLyF.**png?1http://i.imgur.com/gEMrLyF.png?1

Not only was a single slow node able to cause an order of magnitude
performance hit in a dc, but the other dc faired *worse*.


On 09/18/2013 08:50 AM, Chris Burroughs wrote:


On 09/17/2013 04:44 PM, Robert Coli wrote:


On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
chris.burrou...@gmail.com**wrote:

  We have a 2 DC cluster running cassandra 1.2.9.  They are in actual

physically separate DCs on opposite coasts of the US, not just logical
ones.  The primary use of this cluster is CL.ONE reads out of a single
column family.  My expectation was that in such a scenario restarts
would
have minimal impact in the DC where the restart occurred, and no
impact in
the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact on
performance in the other (let's call them DCs A and B).



Did you end up filing a JIRA on this, or some other outcome?

=Rob




No.  I am currently in the process of taking a 2nd cluster from being
single to dual DC.  Once that is done I was going to repeat the test
with each cluster and gather as much information as reasonable.










Re: Multi-dc restart impact

2013-10-09 Thread Chris Burroughs
I have not been able to do the test with the 2nd cluster, but have been 
given a disturbing data point.  We had a disk slowly fail causing a 
significant performance degradation that was only resolved when the 
sick node was killed.

 * Perf in DC w/ sick disk: http://i.imgur.com/W1I5ymL.png?1
 * perf in other DC: http://i.imgur.com/gEMrLyF.png?1

Not only was a single slow node able to cause an order of magnitude 
performance hit in a dc, but the other dc faired *worse*.



On 09/18/2013 08:50 AM, Chris Burroughs wrote:

On 09/17/2013 04:44 PM, Robert Coli wrote:

On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
chris.burrou...@gmail.comwrote:


We have a 2 DC cluster running cassandra 1.2.9.  They are in actual
physically separate DCs on opposite coasts of the US, not just logical
ones.  The primary use of this cluster is CL.ONE reads out of a single
column family.  My expectation was that in such a scenario restarts
would
have minimal impact in the DC where the restart occurred, and no
impact in
the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact on
performance in the other (let's call them DCs A and B).



Did you end up filing a JIRA on this, or some other outcome?

=Rob




No.  I am currently in the process of taking a 2nd cluster from being
single to dual DC.  Once that is done I was going to repeat the test
with each cluster and gather as much information as reasonable.




gossip settling and bootstrap problems

2013-10-07 Thread Chris Burroughs
I've been running into a variety of tricky to diagnose problems recently 
that could be summarized as bootstrap  related tasks fail without 
extra hacky sleep time.


This is a sample edited log file for bootstrapping a node that captures 
the general dynamics: http://pastebin.com/yeN9USLt  This build has been 
modified (from 1.2.10) to sleep 4*RING_DELAY in 
StorageService.bootstrap().  A few notes:

 * At 30s nodes are still flapping UP and DOWN
 * handshaking is still going strong at 90s
 * Things do stabilize; they don't flap indefinitely
 * Bootstrap succeeds once it starts.  In this particular cluster a 
default RING_DELAY/build (30s) fails every time.


Ping times, TCP retransmit, and other general network stuff look fine. 
There are several different tickets (some from me) that reference what 
seemed to me to be possibly similar or at least correlated issues:
 * CASSANDRA-4288 : prevent thrift server from starting before gossip 
has settled

 * CASSANDRA-5815 : NPE from migration manager
 * CASSANDRA-5915 : node flapping prevents replace_node from succeeding 
consistently
 * CASSANDRA-6156 : Poor resilience and recovery for bootstrapping node 
- unable to fetch range

 * CASSANDRA-6127 : vnodes don't scale to hundreds of nodes

I suspect that a combination of factors is causing gossip to take longer 
to stabilize:

 * vnodes
 * (cross country or greater) multi-dc
 * bigger than a test cluster ( 50 nodes)
 * reconnecting snitch

What are other people seeing in their clusters?  Doe anyone routinely 
change RING_DELAY (google finds precious few references)?


Re: Nodes separating from the ring

2013-09-23 Thread Chris Burroughs
I have observed one problem with an inconsistent ring that is 
superficially similar (node thinks it's up but peers disagree) and noted 
details in CASSANDRA-6082.  However, it does not sound like the details 
of either the symptoms, or the resolution match what you describe.


If you have not already, running nodetool goossipinfo might give you 
more clues than `status`.


On 09/13/2013 10:48 AM, Dave Cowen wrote:

Hi, all -

We've been running Cassandra 1.1.12 in production since February, and have
experienced a vexing problem with an arbitrary node falling out of or
separating from the ring on occasion.

When a node falls out of the ring, running nodetool ring on the
misbehaving node shows that the misbehaving node believes that  is Up, but
that the rest of the ring is Down, and the rest of the ring has question
marks listed for load. nodetool ring on any of the other nodes, however,
shows the misbehaving node as Down but everything else is up.

Shutting down and restarting the misbehaving node does not result in
changed behavior. We can only get the misbehaving node to rejoin the ring
by shutting it down, running nodetool removetoken misbehaving node token
and nodetool removetoken force elsewhere in the ring. After the node's
token has been removed from the ring, it will rejoin and behave normally
when it is restarted.

This is not a frequent occurrence - we can go months between this
happening. It most commonly occurs when a different node is brought down
and then back up, but it can happen spontaneously. This is also not
associated with a network connectivity event; we've seen no interruption in
the nodes being able to communicate over the network. As above, it's also
not isolated to a single node; we've seen this behavior on multiple nodes.

This has occurred with both the identical seeds specified in cassandra.yaml
on each node, and also when we remove the node from its own seed list (so
any seed won't try to auto-bootstrap from itself). Seeds have always been
up and available.

Has anyone else seen similar behavior? For obvious reasons, we hate seeing
one of the nodes suddenly fall out and require intervention when we flap
another node, or for no reason at all.

Thanks,

Dave





Re: Multi-dc restart impact

2013-09-18 Thread Chris Burroughs

On 09/17/2013 04:44 PM, Robert Coli wrote:

On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
chris.burrou...@gmail.comwrote:


We have a 2 DC cluster running cassandra 1.2.9.  They are in actual
physically separate DCs on opposite coasts of the US, not just logical
ones.  The primary use of this cluster is CL.ONE reads out of a single
column family.  My expectation was that in such a scenario restarts would
have minimal impact in the DC where the restart occurred, and no impact in
the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact on
performance in the other (let's call them DCs A and B).



Did you end up filing a JIRA on this, or some other outcome?

=Rob




No.  I am currently in the process of taking a 2nd cluster from being 
single to dual DC.  Once that is done I was going to repeat the test 
with each cluster and gather as much information as reasonable.


Re: I don't understand shuffle progress

2013-09-18 Thread Chris Burroughs

On 09/17/2013 09:41 PM, Paulo Motta wrote:

So you're saying the only feasible way of enabling VNodes on an upgraded C*
1.2 is by doing fork writes to a brand new cluster + bulk load of sstables
from the old cluster? Or is it possible to succeed on shuffling, even if
that means waiting some weeks for the shuffle to complete?


In a multi DC cluster situation you *should* be able to bring up a new 
DC with vnodes, bootstrap it, and then decommission the old cluster.


Re: I don't understand shuffle progress

2013-09-18 Thread Chris Burroughs

http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_add_dc_to_cluster_t.html

This is a basic outline.


On 09/18/2013 10:32 AM, Juan Manuel Formoso wrote:

I really like this idea. I can create a new cluster and have it replicate
the old one, after it finishes I can remove the original.

Any good resource that explains how to add a new datacenter to a live
single dc cluster that anybody can recommend?


On Wed, Sep 18, 2013 at 9:58 AM, Chris Burroughs
chris.burrou...@gmail.comwrote:


On 09/17/2013 09:41 PM, Paulo Motta wrote:


So you're saying the only feasible way of enabling VNodes on an upgraded
C*
1.2 is by doing fork writes to a brand new cluster + bulk load of sstables
from the old cluster? Or is it possible to succeed on shuffling, even if
that means waiting some weeks for the shuffle to complete?



In a multi DC cluster situation you *should* be able to bring up a new
DC with vnodes, bootstrap it, and then decommission the old cluster.









Multi-dc restart impact

2013-09-05 Thread Chris Burroughs
We have a 2 DC cluster running cassandra 1.2.9.  They are in actual 
physically separate DCs on opposite coasts of the US, not just logical 
ones.  The primary use of this cluster is CL.ONE reads out of a single 
column family.  My expectation was that in such a scenario restarts 
would have minimal impact in the DC where the restart occurred, and no 
impact in the remote DC.


We are seeing instead that restarts in one DC have a dramatic impact on 
performance in the other (let's call them DCs A and B).


Test scenario on a node in DC A:
 * disablegossip: no change
 * drain: no change
 * stop node: no change
 * start node again: Large increase in latency in both DCs A *and* B

This is a graph showing the increase in latency 
(org.apache.cassandra.metrics.ClientRequest.Latency.Read.95percentile) 
from DC *B* http://i.imgur.com/OkIQyXI.png  (Actual clients report 
similar numbers that agree with this server side measurement).  Latency 
jumps by over an order of magnitude and out of SLAs.  (I would prefer 
restarting to not cause a latency spike in either DC, but the one 
induced in the remote DC is particularly concerning.)


However, the node that was restarted reports only a minor increase in 
latency http://i.imgur.com/KnGEJrE.png  This is confusing from several 
different angles:

 * I would not expect any cross-dc reads to normally be occurring
 * If there were cross DC reads, they would take 50+ ms instead of  5 
ms normally reported
 * If the node that was restarted was still somehow involved it reads, 
it's reporting shows it can only account for a small amount of the 
latency increase.


Some possible relevant configurations:
 * GossipingPropertyFileSnitch
 * dynamic_snitch_update_interval_in_ms: 100
 * dynamic_snitch_reset_interval_in_ms: 60
 * dynamic_snitch_badness_threshold: 0.1
 * read_repair_chance=0.01 and dclocal_read_repair_chance=0.1 (same 
type of behavior was observed with just read_repair_chance=0.1)


Has anyone else observed similar behavior and found a way to limit it? 
This seems like something that ought not to happen but without knowing 
why it is occurring I'm not sure how to stop it.




Re: row cache

2013-09-03 Thread Chris Burroughs

On 09/01/2013 03:06 PM, Faraaz Sareshwala wrote:

Yes, that is correct.

The SerializingCacheProvider stores row cache contents off heap. I believe you
need JNA enabled for this though. Someone please correct me if I am wrong here.

The ConcurrentLinkedHashCacheProvider stores row cache contents on the java heap
itself.



Naming things is hard.  Both caches are in memory and are backed by a 
ConcurrentLinkekHashMap.  In the case of the SerializingCacheProvider 
the *values* are stored in off heap buffers.  Both must store a half 
dozen or so objects (on heap) per entry 
(org.apache.cassandra.cache.RowCacheKey, 
com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue, 
java.util.concurrent.ConcurrentHashMap$HashEntry, etc).  It would 
probably be better to call this a mixed-heap rather than off-heap 
cache.  You may find the number of entires you can hold without gc 
problems to be surprising low (relative to say memcached, or physical 
memory on modern hardware).


Invalidating a column with SerializingCacheProvider invalidates the 
entire row while with ConcurrentLinkedHashCacheProvider it does not. 
SerializingCacheProvider does not require JNA.


Both also use memory estimation of the size (of the values only) to 
determine the total number of entries retained.  Estimating the size of 
the totally on-heap ConcurrentLinkedHashCacheProvider has historically 
been dicey since we switched from sizing in entries, and it has been 
removed in 2.0.0.


As said elsewhere in this thread the utility of the row cache varies 
from absolutely essential to source of numerous problems depending 
on the specifics of the data model and request distribution.





multi-dc clusters with 'local' ips and no vpn

2013-06-17 Thread Chris Burroughs
Cassandra makes the totally reasonable assumption that the entire
cluster is in one routable address space.  We unfortunately had a
situation where:
 * nodes can talk to each other in the same dc on an internal address,
but not talk to each other over their external 1:1 NAT address.
 * nodes can talk to nodes in the other dc over the external address,
but there is no usable shared internal address space they can talk over

In case anyone else finds themselves in the same situation we have what
we think is a working solution in pre-production.  CASSANDRA-5630
handles the reconnect trick to prefer the local ip when in the same
DC.  And some iptables rules allow the local nodes to do the initial
gossiping with each other before that switch.

for each node in same dc:
'iptables -t nat -A OUTPUT -j DNAT -p tcp --dst %s --dport 7000 -o
eth0  --to-destination %s' % (ext_ip, local_ip)


SurgeCon 2012

2012-09-05 Thread Chris Burroughs
Surge [1] is scalability focused conference in late September hosted in
Baltimore.  It's a pretty cool conference with a good mix of
operationally minded people interested in scalability, distributed
systems, systems level performance and good stuff like that.  You should
go! [2]

For those of you who like historical trivia Mike Malone gave a well
recieved Cassandra talk at the first SurgeCon in 2010 [3].

This year there is organised room for BoF's and such with several
one-hour slots Wednesday and Thursday evenings, between 9 p.m. and
midnight for BoFs.  Last year a few of us got together informally around
lunch time [4].

Interested in getting together again this year?  Think we have critical
mass for a BoF?

[1] http://omniti.com/surge/2012

[2] http://omniti.com/surge/2012/register

[3] http://omniti.com/surge/2010/speakers/mike-malone

[4]
http://mail-archives.apache.org/mod_mbox/cassandra-user/201109.mbox/%3c4e82140a.5070...@gmail.com%3E


Re: Distinct Counter Proposal for Cassandra

2012-06-29 Thread Chris Burroughs
On 06/13/2012 01:00 PM, Yuki Morishita wrote:
 The above implementation and most of the other ones (including stream-lib) 
 implement the optimized version of the algorithm which counts up to 10^9, so 
 may need some work.
 
 Other alternative is self-learning bitmap 
 (http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my 
 understanding, is more memory efficient when counting small values.

The closest we could get to a one-size fits all would probably be an
adaptive counting scheme that uses linear counting (or self-learning
bitmap, didn't know about that one!) for small expected cardinalities
and a LogLog variant for higher ones.  It's more choices to make, but
choosing between not too big and really really big doesn't seem like
an unreasonable burden to me.


Re: Distinct Counter Proposal for Cassandra

2012-06-29 Thread Chris Burroughs
Well I obviously think it would be handy.  If this get's proposed and
end's up using stream-lib don't be shy about asking for help.

On a more general note, it would be great to see the special case
Counter code become more general atomic operation code.

On 06/13/2012 01:15 PM, Utku Can Topçu wrote:
 Hi Yuki,
 
 I think I should have used the word discussion instead of proposal for the
 mailing subject. I have quite some of a design in my mind but I think it's
 not yet ripe enough to formalize. I'll try to simplify it and open a Jira
 ticket.
 But first I'm wondering if there would be any excitement in the community
 for such a feature.
 
 Regards,
 Utku
 
 On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita mor.y...@gmail.com wrote:
 
 You can open JIRA ticket at
 https://issues.apache.org/jira/browse/CASSANDRA with your proposal.

 Just for the input:

 I had once implemented HyperLogLog counter to use internally in Cassandra,
 but it turned out I didn't need it so I just put it to gist. You can find
 it here: https://gist.github.com/2597943

 The above implementation and most of the other ones (including stream-lib)
 implement the optimized version of the algorithm which counts up to 10^9,
 so may need some work.

 Other alternative is self-learning bitmap (
 http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
 understanding, is more memory efficient when counting small values.

 Yuki

 On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:

 Hi All,

 Let's assume we have a use case where we need to count the number of
 columns for a given key. Let's say the key is the URL and the column-name
 is the IP address or any cardinality identifier.

 The straight forward implementation seems to be simple, just inserting the
 IP Adresses as columns under the key defined by the URL and using get_count
 to count them back. However the problem here is in case of large rows
 (where too many IP addresses are in); the get_count method has to
 de-serialize the whole row and calculate the count. As also defined in the
 user guides, it's not an O(1) operation and it's quite costly.

 However, this problem seems to have better solutions if you don't have a
 strict requirement for the count to be exact. There are streaming
 algorithms that will provide good cardinality estimations within a
 predefined failure rate, I think the most popular one seems to be the
 (Hyper)LogLog algorithm, also there's an optimal one developed recently,
 please check http://dl.acm.org/citation.cfm?doid=1807085.1807094

 If you want to take a look at the Java implementation for LogLog,
 Clearspring has both LogLog and space optimized HyperLogLog available at
 https://github.com/clearspring/stream-lib

 I don't see a reason why this can't be implemented in Cassandra. The
 distributed nature of all these algorithms can easily be adapted to
 Cassandra's model. I think most of us would love to see come cardinality
 estimating columns in Cassandra.

 Regards,
 Utku



 



Re: Row caching in Cassandra 1.1 by column family

2012-06-18 Thread Chris Burroughs
Check out the rows_cached CF attribute.

On 06/18/2012 06:01 PM, Oleg Dulin wrote:
 Dear distinguished colleagues:
 
 I don't want all of my CFs cached, but one in particular I do.
 
 How can I configure that ?
 
 Thanks,
 Oleg
 



Re: 1.0.3 CLI oddities

2011-12-11 Thread Chris Burroughs
Sounds like https://issues.apache.org/jira/browse/CASSANDRA-3558 and the
other tickets reference there.

On 11/28/2011 05:05 AM, Janne Jalkanen wrote:
 Hi!
 
 (Asked this on IRC too, but didn't get anyone to respond, so here goes...)
 
 Is it just me, or are these real bugs? 
 
 On 1.0.3, from CLI: update column family XXX with gc_grace = 36000; just 
 says null with nothing logged.  Previous value is the default.
 
 Also, on 1.0.3, update column family XXX with 
 compression_options={sstable_compression:SnappyCompressor,chunk_length_kb:64};
  returns Internal error processing system_update_column_family and log says 
 Invalid negative or null chunk_length_kb (stack trace below)
 
 Setting the compression options worked on 1.0.0 when I was testing (though my 
 64 kB became 64 MB, but I believe this was fixed in 1.0.3.)
 
 Did the syntax change between 1.0.0 and 1.0.3? Or am I doing something wrong? 
 
 The database was upgraded from 0.6.13 to 1.0.0, then scrubbed, then 
 compression options set to some CFs, then upgraded to 1.0.3 and trying to set 
 compression on other CFs.
 
 Stack trace:
 
 ERROR [pool-2-thread-68] 2011-11-28 09:59:26,434 Cassandra.java (line 4038) 
 Internal error processing system_update_column_family
 java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
 java.io.IOException: org.apache.cassandra.config.ConfigurationException: 
 Invalid negative or null chunk_length_kb
   at 
 org.apache.cassandra.thrift.CassandraServer.applyMigrationOnStage(CassandraServer.java:898)
   at 
 org.apache.cassandra.thrift.CassandraServer.system_update_column_family(CassandraServer.java:1089)
   at 
 org.apache.cassandra.thrift.Cassandra$Processor$system_update_column_family.process(Cassandra.java:4032)
   at 
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
   at 
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:680)
 Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
 org.apache.cassandra.config.ConfigurationException: Invalid negative or null 
 chunk_length_kb
   at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
   at java.util.concurrent.FutureTask.get(FutureTask.java:83)
   at 
 org.apache.cassandra.thrift.CassandraServer.applyMigrationOnStage(CassandraServer.java:890)
   ... 7 more
 Caused by: java.io.IOException: 
 org.apache.cassandra.config.ConfigurationException: Invalid negative or null 
 chunk_length_kb
   at 
 org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:78)
   at org.apache.cassandra.db.migration.Migration.apply(Migration.java:156)
   at 
 org.apache.cassandra.thrift.CassandraServer$2.call(CassandraServer.java:883)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   ... 3 more
 Caused by: org.apache.cassandra.config.ConfigurationException: Invalid 
 negative or null chunk_length_kb
   at 
 org.apache.cassandra.io.compress.CompressionParameters.validateChunkLength(CompressionParameters.java:167)
   at 
 org.apache.cassandra.io.compress.CompressionParameters.create(CompressionParameters.java:52)
   at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:796)
   at 
 org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:74)
   ... 7 more
 ERROR [MigrationStage:1] 2011-11-28 09:59:26,434 AbstractCassandraDaemon.java 
 (line 133) Fatal exception in thread Thread[MigrationStage:1,5,main]
 java.io.IOException: org.apache.cassandra.config.ConfigurationException: 
 Invalid negative or null chunk_length_kb
   at 
 org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:78)
   at org.apache.cassandra.db.migration.Migration.apply(Migration.java:156)
   at 
 org.apache.cassandra.thrift.CassandraServer$2.call(CassandraServer.java:883)
   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:680)
 Caused by: org.apache.cassandra.config.ConfigurationException: Invalid 
 negative or null chunk_length_kb
   at 
 org.apache.cassandra.io.compress.CompressionParameters.validateChunkLength(CompressionParameters.java:167)
   at 
 org.apache.cassandra.io.compress.CompressionParameters.create(CompressionParameters.java:52)
   at 

Re: Second Cassandra users survey

2011-11-14 Thread Chris Burroughs
 - It would be super cool if all of that counter work made it possible
to support other atomic data types (sets? CAS?  just pass a assoc/commun
Function to apply).
 - Again with types, pluggable type specific compression.
 - Wishy washy wish: Simpler elasticity  I would like to go from
6--8--7 nodes without each of those being an annoying fight with tokens.
 - Gossip as library.  Gossip/failure detection is something C* seems to
have gotten particularly right (or at least it's something that has not
needed to change much).  It would be cool to use Cassandra's gossip
protocol as distributed systems building tool a la ZooKeeper.

On 11/01/2011 06:59 PM, Jonathan Ellis wrote:
 Hi all,
 
 Two years ago I asked for Cassandra use cases and feature requests.
 [1]  The results [2] have been extremely useful in setting and
 prioritizing goals for Cassandra development.  But with the release of
 1.0 we've accomplished basically everything from our original wish
 list. [3]
 
 I'd love to hear from modern Cassandra users again, especially if
 you're usually a quiet lurker.  What does Cassandra do well?  What are
 your pain points?  What's your feature wish list?
 
 As before, if you're in stealth mode or don't want to say anything in
 public, feel free to reply to me privately and I will keep it off the
 record.
 
 [1] 
 http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01148.html
 [2] 
 http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg01446.html
 [3] http://www.mail-archive.com/dev@cassandra.apache.org/msg01524.html
 



Re: CMS GC initial-mark taking 6 seconds , bad?

2011-10-20 Thread Chris Burroughs
On 10/20/2011 09:38 AM, Maxim Potekhin wrote:
 I happen to have 48GB on each machines I use in the cluster. Can I
 assume that I can't really use all of this memory productively? Do you
 have any suggestion related to that? Can I run more than one instance on
 Cassandra on the same box (using different ports) to take advantage of
 this memory, assuming the disk has enough bandwidth?

You are likely to not have good luck with a JVM heap that large.  But
you can:
 - Leave all that memory to the OS page cache.
 - mmap index files
 - use an off heap cache

All of those are productive uses.


Re: ApacheCon meetup?

2011-10-12 Thread Chris Burroughs
On 10/11/2011 12:05 PM, Eric Evans wrote:
 Let's do it.  We can organize an official one, and still grab food
 together if that's not enough. :)

Great!  Thanks for putting this together.


ApacheCon meetup?

2011-10-04 Thread Chris Burroughs
ApacheCon NA is coming up next month.  I suspect there will be at least
a few Cassandra users there (yeah new release!).  Would anyone be
interested in getting together and sharing some stories?  This could
either be a official [1] meetup.  Or grabbing food together sometime.

[1] http://wiki.apache.org/apachecon/ApacheMeetupsNa11


Re: Surgecon Meetup?

2011-09-27 Thread Chris Burroughs
So it sounds like there are about a half dozen of us, some coming
Wednesday, others Thursday.  I'll have some Cassandra eye logos out
around lunch both of those days.  If that herds us together then
success!  If not I'll try something more formal.

Looking forward to meeting everyone.

On 09/25/2011 07:27 PM, Chris Burroughs wrote:
 Surge [1] is scalability focused conference in late September hosted in
 Baltimore.  It's a pretty cool conference with a good mix of
 operationally minded people interested in scalability, distributed
 systems, systems level performance and good stuff like that.  You should
 go! [2]
 
 Anyway, I'll be there if there, and are if any other Cassandra users are
 coming I'm happy to help herd us towards meeting up, lunch, hacking,
 etc.  I *think* there might be some time for structured BoF type
 sessions as well.
 
 
 [1] http://omniti.com/surge/2011
 
 [2] Actually tickets recenlty sold out, you should go in 2012!



Surgecon Meetup?

2011-09-25 Thread Chris Burroughs
Surge [1] is scalability focused conference in late September hosted in
Baltimore.  It's a pretty cool conference with a good mix of
operationally minded people interested in scalability, distributed
systems, systems level performance and good stuff like that.  You should
go! [2]

Anyway, I'll be there if there, and are if any other Cassandra users are
coming I'm happy to help herd us towards meeting up, lunch, hacking,
etc.  I *think* there might be some time for structured BoF type
sessions as well.


[1] http://omniti.com/surge/2011

[2] Actually tickets recenlty sold out, you should go in 2012!


Re: cassandra server disk full

2011-07-29 Thread Chris Burroughs
On 07/25/2011 01:53 PM, Ryan King wrote:
 Actually I was wrong– our patch will disable gosisp and thrift but
 leave the process running:
 
 https://issues.apache.org/jira/browse/CASSANDRA-2118
 
 If people are interested in that I can make sure its up to date with
 our latest version.

Thanks Ryan.

/me expresses interest.

Zombie nodes when the file system does something interesting are not fun.


Re: Survey: Cassandra/JVM Resident Set Size increase

2011-07-29 Thread Chris Burroughs
Thanks to everyone who responded (I think I learned a few new tricks
from seeing what you tried and how your monitor).  I didn't see any
patterns in JVM, OS, cassandra versions etc.

At this time I'm confident in saying CASSANDRA-2868 (and thus really
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7066129) is the culprit.

On 07/12/2011 09:28 AM, Chris Burroughs wrote:
 ### Preamble
 
 There have been several reports on the mailing list of the JVM running
 Cassandra using too much memory.  That is, the resident set size is
 (max java heap size + mmaped segments) and continues to grow until the
 process swaps, kernel oom killer comes along, or performance just
 degrades too far due to the lack of space for the page cache.  It has
 been unclear from these reports if there is a pattern.  My hope here is
 that by comparing JVM versions, OS versions, JVM configuration etc., we
 will find something.  Thank you everyone for your time.
 
 
 Some example reports:
  - http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html
  -
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html
  - https://issues.apache.org/jira/browse/CASSANDRA-2868
  -
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html
  -
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-problem-td6545642.html
 
 For reference theories include (in no particular order):
  - memory fragmentation
  - JVM bug
  - OS/glibc bug
  - direct memory
  - swap induced fragmentation
  - some other bad interaction of cassandra/jdk/jvm/os/nio-insanity.
 
 ### Survey
 
 1. Do you think you are experiencing this problem?
 
 2.  Why? (This is a good time to share a graph like
 http://www.twitpic.com/5fdabn or
 http://img24.imageshack.us/img24/1754/cassandrarss.png)
 
 2. Are you using mmap? (If yes be sure to have read
 http://wiki.apache.org/cassandra/FAQ#mmap , and explain how you have
 used pmap [or another tool] to rule you mmap and top decieving you.)
 
 3. Are you using JNA?  Was mlockall succesful (it's in the logs on startup)?
 
 4. Is swap enabled? Are you swapping?
 
 5. What version of Apache Cassandra are you using?
 
 6. What is the earliest version of Apache Cassandra you recall seeing
 this problem with?
 
 7. Have you tried the patch from CASSANDRA-2654 ?
 
 8. What jvm and version are you using?
 
 9. What OS and version are you using?
 
 10. What are your jvm flags?
 
 11. Have you tried limiting direct memory (-XX:MaxDirectMemorySize)
 
 12. Can you characterise how much GC your cluster is doing?
 
 13. Approximately how many read/writes per unit time is your cluster
 doing (per node or the whole cluster)?
 
 14.  How are you column families configured (key cache size, row cache
 size, etc.)?
 



Re: JNA to avoid swap but physical memory increase

2011-07-15 Thread Chris Burroughs
On 07/15/2011 07:24 AM, Daniel Doubleday wrote:
 Also our experience shows that the jna call does not prevent swapping so the 
 general advice is disable swap.

Can you confirm you don't get the (paraphrasing) whoops we tried
mlockall but ulimits denied us message on startup?


Re: Storing counters in the standard column families along with non-counter columns ?

2011-07-14 Thread Chris Burroughs
On 07/13/2011 03:57 PM, Aaron Morton wrote:
 You can always use a dedicated CF for the counters, and use the same row key.

Of course one could do this.  The problem is you are now spending ~2x
disk space on row keys, and app specific client code just became more
complicated.


Survey: Cassandra/JVM Resident Set Size increase

2011-07-12 Thread Chris Burroughs
### Preamble

There have been several reports on the mailing list of the JVM running
Cassandra using too much memory.  That is, the resident set size is
(max java heap size + mmaped segments) and continues to grow until the
process swaps, kernel oom killer comes along, or performance just
degrades too far due to the lack of space for the page cache.  It has
been unclear from these reports if there is a pattern.  My hope here is
that by comparing JVM versions, OS versions, JVM configuration etc., we
will find something.  Thank you everyone for your time.


Some example reports:
 - http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html
 -
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html
 - https://issues.apache.org/jira/browse/CASSANDRA-2868
 -
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html
 -
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-problem-td6545642.html

For reference theories include (in no particular order):
 - memory fragmentation
 - JVM bug
 - OS/glibc bug
 - direct memory
 - swap induced fragmentation
 - some other bad interaction of cassandra/jdk/jvm/os/nio-insanity.

### Survey

1. Do you think you are experiencing this problem?

2.  Why? (This is a good time to share a graph like
http://www.twitpic.com/5fdabn or
http://img24.imageshack.us/img24/1754/cassandrarss.png)

2. Are you using mmap? (If yes be sure to have read
http://wiki.apache.org/cassandra/FAQ#mmap , and explain how you have
used pmap [or another tool] to rule you mmap and top decieving you.)

3. Are you using JNA?  Was mlockall succesful (it's in the logs on startup)?

4. Is swap enabled? Are you swapping?

5. What version of Apache Cassandra are you using?

6. What is the earliest version of Apache Cassandra you recall seeing
this problem with?

7. Have you tried the patch from CASSANDRA-2654 ?

8. What jvm and version are you using?

9. What OS and version are you using?

10. What are your jvm flags?

11. Have you tried limiting direct memory (-XX:MaxDirectMemorySize)

12. Can you characterise how much GC your cluster is doing?

13. Approximately how many read/writes per unit time is your cluster
doing (per node or the whole cluster)?

14.  How are you column families configured (key cache size, row cache
size, etc.)?



Re: Storing counters in the standard column families along with non-counter columns ?

2011-07-11 Thread Chris Burroughs
On 07/10/2011 01:09 PM, Aditya Narayan wrote:
 Is there any target version in near future for which this has been promised
 ?

The ticket is problematic in that it would -- unless someone has a
clever new idea -- require breaking thrift compatibility to add it to
the api.  Since is unfortunate since it would be so useful.

If it's in the 0.8.x series it will only be through CQL.


Re: Cassandra DC Upcoming Meetup

2011-07-05 Thread Chris Burroughs
On 06/15/2011 08:57 AM, Chris Burroughs wrote:
 Cassandra DC's first meetup of the pizza and talks variety will be on
 July 6th. There will be an introductory sort of presentation and a
 totally cool one on Pig integration.
 
 If you are in the DC area it would be great to see you there.
 
 http://www.meetup.com/Cassandra-DC-Meetup/events/22145481/

My totally anecdotal impression from going to several Big
Data/Hadoop/JUG meetups in the DC area  is that there is a reasonable
amount of interest, but not a large amount of production use.  In other
words, this is a great time to bring along your Cassandra Curious
friends and co-workers! Hope to see some of you tomorrow.

Chris Burroughs


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/22/2011 10:03 PM, Edward Capriolo wrote:
 I have not read the original thread concerning the problem you mentioned.
 One way to avoid OOM is large amounts of RAM :) On a more serious note most
 OOM's are caused by setting caches or memtables too large. If the OOM was
 caused by a software bug, the cassandra devs are on the ball and move fast.
 I still suggest not jumping into a release right away. 

For what it's worth  that particular thread was about the kernel oom
killer, which is a good example of a the kind of gotcha that has caused
several people to chime in with the importance of monitoring both
Cassandra and the OS.


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/22/2011 07:12 PM, Les Hazlewood wrote:
 Telling me to read the mailing lists and follow the issue tracker and use
 monitoring software is all great and fine - and I do all of these things
 today already - but this is a philosophical recommendation that does not
 actually address my question.  So I chalk this up as an error on my side in
 not being clear in my question - my apologies.  Let me reformulate it :)

For what it's worth that was intended as a concrete suggestion.  We
adopted Cassandra a year ago when (IMHO) it was a mistake to do so it
without the willingness to develop sufficient in house expertise to
internally patch/fork/debug if needed.  Things are more mature now, best
practices more widespread etc., but you should judge that yourself.

In the spirit of your re-formulated questions:
 - Read-before-write is a Cassandra anti-pattern, avoid it if at all
possible.
 - Those optional lines in the env script about GC logging?  Uncomment
them on at least some of your boxes.
 - use MLOCKALL+mmap, or standard io, but not mmap without MLOCKALL.


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/23/2011 01:56 PM, Les Hazlewood wrote:
 Is there a roadmap or time to 1.0?  Even a ballpark time (e.g next year 3rd
 quarter, end of year, etc) would be great as it would help me understand
 where it may lie in relation to my production rollout.


The C* devs are rather strongly inclined against putting too much
meaning in version numbers.  The next major release might be called 1.0.
Or maybe it won't.  Either way it won't be different code or support
from something called 0.9 or 10.0.

September 8th is the feature freeze for the next major release.


Re: BloomFilterFalsePositives equals 1.0

2011-06-22 Thread Chris Burroughs
To be precise, you made n requests for non-existent keys, got n negative
responses, and BloomFilterFalsePositives also went up by n?

On 06/21/2011 11:06 PM, Preston Chang wrote:
 Hi,all:
  I have a problem with bloom filter. When made a test which tried to get
 some nonexistent keys, it seemed that the bloom filter does not work. The
 'BloomFilterFalseRatio' was 1.0 and the 'BloomFilterFalsePositives' was
 rising and the disk I/O utils reached 100% according to 'iostat'.
 
 I found the patch in
 https://issues.apache.org/jira/browse/CASSANDRA-2637 , but in my cluster key
 cache had been enabled already.  My Cassandra version is 0.7.3. There are 3
 nodes and RF is 3.
 
 Thanks for your help.
 



Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Chris Burroughs
On 06/22/2011 08:53 AM, Sasha Dolgy wrote:
 Yes ... this is because it was the OS that killed the process, and
 wasn't related to Cassandra crashing.  Reviewing our monitoring, we
 saw that memory utilization was pegged at 100% for days and days
 before it was finally killed because 'apt' was fighting for resource.
 At least, that's as far as I got in my investigation before giving up,
 moving to 0.8.0 and implementing 24hr nodetool repair on each node via
 cronjobso far ... no problems.

In `free` terms, by pegged do you mean that free Mem was 0, or -/+
buffers/cache as 0?


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Chris Burroughs
On 06/22/2011 05:33 PM, Les Hazlewood wrote:
 Just to be clear:
 
 I understand that resources like [1] and [2] exist, and I've read them.  I'm
 just wondering if there are any 'gotchas' that might be missing from that
 documentation that should be considered and if there are any recommendations
 in addition to these documents.
 
 Thanks,
 
 Les
 
 [1] http://www.datastax.com/docs/0.8/operations/index
 [2] http://wiki.apache.org/cassandra/Operations
 

Well if they new some secret gotcha the dutiful cassandra operators of
the world would update the wiki.

The closest thing to a 'gotcha' is that neither Cassandra nor any other
technology is going to get you those nines.  Humans will need to commit
to reading the mailing lists, following JIRA, and understanding what the
code is doing.  And humans will need to commit to combine that
understanding with monitoring and alerting to figure out all of the it
depends for your particular case.


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Chris Burroughs
Do all of the reductions in Used on that graph correspond to node restarts?

My Zabbix for reference: http://img194.imageshack.us/img194/383/2weekmem.png


On 06/22/2011 06:35 PM, Sasha Dolgy wrote:
 http://www.twitpic.com/5fdabn
 http://www.twitpic.com/5fdbdg
 
 i do love a good graph.  two of the weekly memory utilization graphs
 for 2 of the 4 servers from this ring... week 21 was a nice week ...
 the week before 0.8.0 went out proper.  since then, bumped up to 0.8
 and have seen a steady increase in the memory consumption (used) but
 have not seen the swap do what it did ...and the buffered/cached seems
 much better
 
 -sd
 
 On Thu, Jun 23, 2011 at 12:09 AM, Chris Burroughs
 chris.burrou...@gmail.com wrote:

 In `free` terms, by pegged do you mean that free Mem was 0, or -/+
 buffers/cache as 0?



Cassandra DC Upcoming Meetup

2011-06-15 Thread Chris Burroughs
Cassandra DC's first meetup of the pizza and talks variety will be on
July 6th. There will be an introductory sort of presentation and a
totally cool one on Pig integration.

If you are in the DC area it would be great to see you there.

http://www.meetup.com/Cassandra-DC-Meetup/events/22145481/


Re: Data directories

2011-06-09 Thread Chris Burroughs
On 06/08/2011 05:54 AM, Héctor Izquierdo Seliva wrote:
 Is there a way to control what sstables go to what data directory? I
 have a fast but space limited ssd, and a way slower raid, and i'd like
 to put latency sensitive data into the ssd and leave the other data in
 the raid. Is this possible? If not, how well does cassandra play with
 symlinks?
 

Another option would be to use the ssd as a block level cache with
something like flashcache https://github.com/facebook/flashcache/.


Re: Index interval tuning

2011-05-11 Thread Chris Burroughs
On 05/10/2011 10:24 PM, aaron morton wrote:
 What version and what were the values for RecentBloomFilterFalsePositives and 
 BloomFilterFalsePositives ?
 
 The bloom filter metrics are updated in SSTableReader.getPosition() the only 
 slightly odd thing I can see is that we do not count a key cache hit a a true 
 positive for the bloom filter. If there were a lot of key cache hits and a 
 few false positives the ratio would be wrong. I'll ask around, does not seem 
 to apply to Hectors case though. 

0.7.1  No key cache.

BloomFilterFalsePositives: 48130
Read Count: 153973494
RecentBloomFilterFalsePositives: 4, 1, 2, 0, 0, 1


Re: Index interval tuning

2011-05-10 Thread Chris Burroughs
On 05/10/2011 02:12 PM, Peter Schuller wrote:
 That reminds me, my false positive ration is stuck at 1.0, so I guess
 bloom filters aren't doing a lot for me.
 
 That sounds unlikely unless you're hitting some edge case like reading
 a particular row that happened to be a collision, and only that row.
 This is from JMX stats on the column family store?
 

(From jmx)  I also see BloomFilterFalseRatio stuck at 1.0 on my
production nodes.  The only values that RecentBloomFilterFalseRatio had
over the past several minutes were 0.0 and 1.0.  While I can't prove
that isn't accurate, it is very suspicions.

The code looked reasonable until I got to SSTableReader, which was too
complicated to just glance through.


Re: Native heap leaks?

2011-05-05 Thread Chris Burroughs
On 2011-05-05 06:30, Hannes Schmidt wrote:
 This was my first thought, too. We switched to mmap_index_only and
 didn't see any change in behavior. Looking at the smaps file attached
 to my original post, one can see that the mmapped index files take up
 only a minuscule part of RSS.

I have not looked into smaps before. But it actually seems odd that that
mmaped Index files are taking up so *little memory*.  Are they only a
few kb on disk?  Is this a snapshot taken shortly after the process
started or before the OOM killer is presumably about to come along.  How
long does it take to go from 1.1 G to 2.1 G resident?  Either way, it
would be worthwhile to set one node to standard io to make sure it's
really not mmap causing the problem.

Anyway, assuming it's not mmap, here are the other similar threads on
the topic.  Unfortunately none of them claim an obvious solution:

http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html
http://www.mail-archive.com/user@cassandra.apache.org/msg08063.html
http://www.mail-archive.com/user@cassandra.apache.org/msg12036.html
http://mail.openjdk.java.net/pipermail/hotspot-dev/2011-April/004091.html


Cassandra Meetup in DC

2011-05-02 Thread Chris Burroughs
http://www.meetup.com/Cassandra-DC-Meetup/


*What*: First Cassandra DC Meetup

*When*: Thursday, May 12, 2011 at 6:30 PM

*Where*: Northside Social Coffee  Wine - 3211 Wilson Blvd Arlington, VA


I'm pleased to announce the the first Cassandra DC Meetup
http://www.meetup.com/Cassandra-DC-Meetup/events/17207138/. Come have
a drink, meet your fellow members, talk about Apache Cassandra, discuss
Greek mythological prophets, and what you want out of the group.


flashcache experimentation

2011-04-18 Thread Chris Burroughs
https://github.com/facebook/flashcache/

FlashCache is a general purpose writeback block cache for Linux.

We have a case where:
 - Access to data is not uniformly random (let's say Zipfian).
 - The hot set  RAM.
 - Size of disk is such that buying enough SSDs, fast drives, multiple
drives, etc would be undesirable.

This seems like a good case for flashcache.  However, as far as I can
tell from searching no one has tried this and posted any results.  I was
wondering if anyone has tried flashcache in a similar situation with
Cassandra and if so how the experience went.


Re: CL.ONE reads / RR / badness_threshold interaction

2011-04-12 Thread Chris Burroughs
On 04/12/2011 06:27 PM, Peter Schuller wrote:
 So to increase pinny-ness I'll further reduce RR chance and set a
 badness threshold.  Thanks all.
 
 Just be aware that, assuming I am not missing something, while this
 will indeed give you better cache locality under normal circumstances
 - once that closest node does go down, traffic will then go to a
 node which will have potentially zero cache hit rate on that data
 since all reads up to that point were taken by the node that just went
 down.
 
 So it's not an obvious win depending.


Yeah there less than great behaviour when nodes are restarted or
otherwise go down with this configuration.  Probably still preferable
for my current situation.  Other's mileage may vary.


http://img27.imageshack.us/img27/85/cacherestart.png


Re: quick repair tool question

2011-04-12 Thread Chris Burroughs
On 04/12/2011 11:11 AM, Jonathan Colby wrote:
 I'm not sure if this is the kosher way to rebuild the sstable data, but it 
 seemed to work.  

http://wiki.apache.org/cassandra/Operations#Handling_failure

Option #3.



Analysing hotspot gc logs

2011-04-11 Thread Chris Burroughs
To avoid taking my own thread [1] off on a tangent.  Does anyone have a
reccomendation for a tool to graphical analysis (ie make useful graphs)
out of hoptspot gc logs?  Google searches have turned up several results
along the lines of go try this zip file [2].

[1] http://www.mail-archive.com/user@cassandra.apache.org/msg12134.html

[2]
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2009-August/000420.html


Re: Minor Follow-up: reduced cached mem; resident set size growth

2011-04-08 Thread Chris Burroughs
On 04/05/2011 03:04 PM, Chris Burroughs wrote:

 I have gc logs if anyone is interested.

This is from a node with standard io, jna enabled, but limits were not
set for mlockall to succeed.  One can see -/+ buffers/cache free
shrinking and the C* pid's RSS growing.


Includes several days of:
gc log
free -s
/proc/$PID/status

http://www.filefactory.com/file/ca94892/n/04-08.tar.gz

Please enjoy!  (If there is a preferred way to share the tarball let me
know.)


Re: CL.ONE reads / RR / badness_threshold interaction

2011-04-07 Thread Chris Burroughs
Peter, thank you for the extremely detailed reply.

To now answer my own question, the critical points that are different
from what I said earlier are: that CL.ONE does prefer *one* node (which
one depending on snitch) and that RR uses digests (which are not
mentioned on the wiki page [1]) instead of comparing raw requests.
Totally tangential, but in the case of CL.ONE with narrow rows making
the request and taking the fastest would probably be better, but having
things work both ways depending on row size sounds painfully
complicated.  (As Aaron points out this is not how things work now.)

I am assuming that RR digests save on bandwidth, but to generate the
digest with a row cache miss the same number of disk seeks are required
(my nemesis is disk io).

So to increase pinny-ness I'll further reduce RR chance and set a
badness threshold.  Thanks all.


[1] http://wiki.apache.org/cassandra/ReadRepair


Re: Minor Follow-up: reduced cached mem; resident set size growth

2011-04-06 Thread Chris Burroughs
On 04/05/2011 04:38 PM, Peter Schuller wrote:
 - Different collectors: -XX:+UseParallelGC -XX:+UseParallelOldGC
 
 Unless you also removed the -XX:+UseConcMarkSweepGC I *think* it takes
 precedence, so that the above options would have no effect. I didn't
 test. In either case, did you definitely confirm CMS was no longer
 being used? (Should be pretty obvious if you ran with
 -XX:+PrintGCDetails which looks plenty different w/o CMS)
 

More precisely, I did this:

# GC tuning options
#JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC
#JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC
#JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled
#JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8
#JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1
#JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75
#JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly
JVM_OPTS=$JVM_OPTS -XX:+UseParallelGC
JVM_OPTS=$JVM_OPTS -XX:+UseParallelOldGC


 I have gc logs if anyone is interested.
 
 Yes :)



By have gc logs I meant had them until I accidental blew them away
while restarting a server.  Will post them in a day or two when there
is a reasonable amount of data or the quantum state collapses and the
problem vanishes when it is observed.


 [1] http://img194.imageshack.us/img194/383/2weekmem.png
 
 I did go back and revisit the old thread... maybe I'm missing
 something, but just to be real sure:
 
 What does the no color/white mean on this graph? Is that application
 memory (resident set)?
 
 I'm not really sure what I'm looking for since you already said you
 tested with 'standard' which rules out the
 resident-set-memory-as-a-result-of-mmap being counted towards the
 leak. But still.
 

I will be the first to admit that Zabbix's graphs are not the... easiest
to read.  My interpretation is that no color is none of the above
and by being unavailable is thus in use by applications.  This fits with
what I see will free and measurements of the RSS of the jvm from /proc/.
 I'll leave free -s going for a few days while waiting on the gc logs as
an extra sanity test.  That's probably easier to reason about anyway.


CL.ONE reads / RR / badness_threshold interaction

2011-04-06 Thread Chris Burroughs
My understanding for For CL.ONE.  For the node that receives the request:

(A) If RR is enabled and this node contains the needed row -- return
immediately and do RR to remaining replicas in background.
(B) If RR is off and this node contains the needed row -- return the
needed data immediately.
(C) If this node does not have the needed row -- regardless of RR ask
all replicas and return the first result.


However case (C) as I have described it does not allow for any notion of
'pinning' as mentioned for dynamic_snitch_badness_threshold:

# if set greater than zero and read_repair_chance is  1.0, this will allow
# 'pinning' of replicas to hosts in order to increase cache capacity.
# The badness threshold will control how much worse the pinned host has
to be
# before the dynamic snitch will prefer other replicas over it.  This is
# expressed as a double which represents a percentage.  Thus, a value of
# 0.2 means Cassandra would continue to prefer the static snitch values
# until the pinned host was 20% worse than the fastest.


The wiki states CL.ONE Will return the record returned by the first
replica to respond [1] implying that the request goes to multiple
replicas, but datastax's docs state that only one node will receive the
request (Returns the response from *the* closest replica, as determined
by the snitch configured for the cluster [2]).

Could someone clarify how CL.ONE reads with RR off work?


[1] http://wiki.apache.org/cassandra/API
[2]
http://www.datastax.com/docs/0.7/consistency/index#choosing-consistency-levels
 emphasis added


Re: IndexInterval Tuning

2011-04-05 Thread Chris Burroughs
On 04/05/2011 09:57 AM, Jonathan Ellis wrote:
 On Tue, Apr 5, 2011 at 8:54 AM, Jonathan Ellis jbel...@gmail.com wrote:
 Adjusting indexinterval is unlikely to be useful on very narrow rows.
 (Its purpose is to make random access to _large_ rows doable.)
 
 Whoops, that's column_index_size_in_kb.
 
 I'd play w/ keycache before index_interval personally.  (If you can
 get 100% key cache hit rate it doesn't really matter what index
 interval is, as long as you can still build the cache effectively.)


I've already tried a key cache equal to and larger (up to what I have
heap space for) than my current row cache.  But for very narrow rows the
row cache is empirically and theoretically better.

I realise changing IndexInterval is an unusual proposed configuration,
but such is the burden of high cardinality narrow rows.


Minor Follow-up: reduced cached mem; resident set size growth

2011-04-05 Thread Chris Burroughs
This is a minor followup to this thread which includes required context:

http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html

I haven't solved the problem, but since negative results can also be
useful I thought I would share them.  Things I tried unsuccessfully (on
individual nodes except for the upgrade):

- Upgrade from Cassandra 0.6 to 0.7
- Different collectors: -XX:+UseParallelGC -XX:+UseParallelOldGC
- JNA (but not mlockall)
- Switch disk_access_mode from standard to mmap_index_only (obviously in
this case RSS is less than useful, but overall memory graph still was
bad looking like this [1]).


On #cassandra there was speculation that a large (200k) row cache may be
inducing heap fragmentation.  I have not ruled this out but have been
unable to do that in stand alone ConcurrentLinkedHashMap stress testing.
 Since turning off the row cache would be a cure worse than the disease
I have not tried that yet with a real cluster.

Future possibilities would be to get the limits set right for mlockall,
trying combinations of the above, and running without caches.

I have gc logs if anyone is interested.

[1] http://img194.imageshack.us/img194/383/2weekmem.png


Re: How to determine if repair need to be run

2011-03-30 Thread Chris Burroughs
On 03/29/2011 01:18 PM, Peter Schuller wrote:
 (What *would* be useful perhaps is to be able to ask a node for the
 time of its most recently started repair, to facilitate easier
 comparison with GCGraceSeconds for monitoring purposes.)

I concur.  JIRA time?

(Perhaps keeping track of the same thing for major compactions would
also be useful?)


Re: On 0.6.6 to 0.7.3 migration, DC-aware traffic and minimising data transfer

2011-03-14 Thread Chris Burroughs
On 03/11/2011 03:46 PM, Jonathan Ellis wrote:
 Repairs is not yet WAN-optimized but is still cheap if your replicas
 are close to consistent since only merkle trees + inconsistent ranges
 are sent over the network.
 

What is the ticket number for WAN optimized repair?


Re: cassandra in-production experiences with .7 series

2011-03-07 Thread Chris Burroughs
On 03/05/2011 05:27 PM, Paul Pak wrote:
 Hello all,
 
 I was wondering if people could share their overall experiences with
 using .7 series of Cassandra in production?  Is anyone using it?
 

For what it's worth we are using a dozen node 0.7.x cluster have not had
any major problems (our uses cases dodged most of the less pleasant
bugs).  This replaced  a smaller 0.6.x cluster that we were not happy with.

Weather the new code really helped (the main feature we wanted was mx4j
do to idiosyncratic features of our monitoring system) or not we didn't
have time to experimentally determine.


Re: Reducing memory footprint

2011-03-07 Thread Chris Burroughs
On 03/04/2011 03:51 PM, Casey Deccio wrote:
 Are you saying: that you want a smaller heap and what settings to change
 to accommodate that, or that you have already set a small heap of x and
 Cassandra is using significantly more than that?

 
 Based on my observation above, the latter.
 
 Casey
 

As  Aaron said then the first things to look at are your jvm settings,
jvm version, and io configuration (standard v mmap).


You may also wish to read this thread:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/reduced-cached-mem-resident-set-size-growth-td5967110.html


Re: Reducing memory footprint

2011-03-04 Thread Chris Burroughs
On 03/04/2011 01:53 PM, Casey Deccio wrote:
 I have a small ring of cassandra nodes that have somewhat limited memory
 capacity for the moment.  Cassandra is eating up all the memory on these
 nodes.  I'm not sure where to look first in terms of reducing the foot
 print.  Keys cached?  Compaction?
 
 Any hints would be greatly appreciated.
 
 Regards,
 Casey
 

What do you mean by eating up the memory?  Resident set size, low
memory available to page cache, excessive gc of the jvm's heap?

Are you saying: that you want a smaller heap and what settings to change
to accommodate that, or that you have already set a small heap of x and
Cassandra is using significantly more than that?


Re: OOM exceptions

2011-03-04 Thread Chris Burroughs
- Does this occur only during compaction or at seemingly random times?
- How large is your heap?  What jvm settings are you using? How much
physical RAM do you have?
- Do you have the row and/or key cache enabled?  How are they
configured?  How large are they when the OOM is thrown?

On 03/04/2011 02:38 PM, Mark Miller wrote:
 Other than adding more memory to the machine is there a way to solve
 this? Please help. Thanks
 
 ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java
 (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception
 in thread Thread[COMPACTION-POOL:1,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2798)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
 at
 org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298)
 at
 org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66)
 
 at
 org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311)
 
 at
 org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284)
 
 at
 org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
 
 at
 org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99)
 
 at
 org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140)
 
 at
 org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
 
 at
 org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
 
 at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
 
 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
 
 at
 org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
 
 at
 org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
 
 at
 org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294)
 
 at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101)
 
 at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 
 at java.lang.Thread.run(Thread.java:636)
 



Re: OOM exceptions

2011-03-04 Thread Chris Burroughs
See also:
http://www.datastax.com/docs/0.7/troubleshooting/index#nodes-are-dying-with-oom-errors

On 03/04/2011 03:05 PM, Chris Burroughs wrote:
 - Does this occur only during compaction or at seemingly random times?
 - How large is your heap?  What jvm settings are you using? How much
 physical RAM do you have?
 - Do you have the row and/or key cache enabled?  How are they
 configured?  How large are they when the OOM is thrown?
 
 On 03/04/2011 02:38 PM, Mark Miller wrote:
 Other than adding more memory to the machine is there a way to solve
 this? Please help. Thanks

 ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java
 (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception
 in thread Thread[COMPACTION-POOL:1,5,main]
 java.lang.OutOfMemoryError: Java heap space
 at java.util.Arrays.copyOf(Arrays.java:2798)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
 at
 org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298)
 at
 org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66)

 at
 org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311)

 at
 org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284)

 at
 org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)

 at
 org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99)

 at
 org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140)

 at
 org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)

 at
 org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)

 at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)

 at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)

 at
 org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)

 at
 org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)

 at
 org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294)

 at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101)

 at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82)
 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
 at java.util.concurrent.FutureTask.run(FutureTask.java:166)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)

 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)

 at java.lang.Thread.run(Thread.java:636)

 



Re: OOM exceptions

2011-03-04 Thread Chris Burroughs
- Are you using a key cache?  How many keys do you have?  Across how
many column families

You configuration is unusual both in terms of not setting min heap ==
max heap and the percentage of available RAM used for the heap.  Did you
change the heap size in response to errors or for another reason?

On 03/04/2011 03:25 PM, Mark wrote:
 This happens during compaction and we are not using the RowsCached
 attribute.
 
 Our initial/max heap are 2 and 6 respectively and we have 8 gigs in
 these machines.
 
 Thanks
 
 On 3/4/11 12:05 PM, Chris Burroughs wrote:
 - Does this occur only during compaction or at seemingly random times?
 - How large is your heap?  What jvm settings are you using? How much
 physical RAM do you have?
 - Do you have the row and/or key cache enabled?  How are they
 configured?  How large are they when the OOM is thrown?

 On 03/04/2011 02:38 PM, Mark Miller wrote:
 Other than adding more memory to the machine is there a way to solve
 this? Please help. Thanks

 ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java
 (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception
 in thread Thread[COMPACTION-POOL:1,5,main]
 java.lang.OutOfMemoryError: Java heap space
  at java.util.Arrays.copyOf(Arrays.java:2798)
  at
 java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
  at java.io.DataOutputStream.write(DataOutputStream.java:107)
  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
  at
 org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298)

  at
 org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66)


  at
 org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311)


  at
 org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284)


  at
 org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)


  at
 org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99)


  at
 org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140)


  at
 org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)


  at
 org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)


  at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)


  at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)


  at
 org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)


  at
 org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)


  at
 org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294)


  at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101)


  at
 org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82)

  at
 java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
  at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)


  at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)


  at java.lang.Thread.run(Thread.java:636)

 



Re: Column name size

2011-02-11 Thread Chris Burroughs
On 02/11/2011 05:06 AM, Patrik Modesto wrote:
 Hi all!
 
 I'm thinking if size of a column name could matter for a large dataset
 in Cassandra  (I mean lots of rows). For example what if I have a row
 with 10 columns each has 10 bytes value and 10 bytes name. Do I have
 half the row size just of the column names and the other half of the
 data (not counting storage overhead)?  What if I have 10M of these
 rows? Is there a difference? Should I use some 3bytes codes for a
 column name to save memory/bandwidth?
 
 Thanks,
 Patrik

You are correct that you can for small row/column key values they key
itself can represent a large proportion of the total size.  I think you
will find the consensus on  this list is that trying to be clever with
names is usually not worth the additional complexity.

The right solution to this is
https://issues.apache.org/jira/browse/CASSANDRA-47.


Re: Out of control memory consumption

2011-02-09 Thread Chris Burroughs
On 02/09/2011 11:15 AM, Huy Le wrote:
 There is already an email thread on memory issue on this email list, but I
 creating a new thread as we are experiencing a different memory consumption
 issue.
 
 We are 12-server cluster.  We use random partitioner with manually generated
 server tokens.  Memory usage on one server keeps growing out of control.  We
 ran flush and cleared key and row caches but and ran GC but heap memory
 usage won't go down.  The only way to heap memory usage to go down is the
 restart cassandra.  We have to do this one a day.  All other servers have
 heap memory usage less than 500MB.  This issue happened on both Cassandra
 0.6.6 and 0.6.11.
 

If the heap usages continues to grow an OOM will eventually be thrown.
Are you experiencing OOMs on these boxes?  If you are not OOMing, then
what problem are you experiencing (excessive CPU use garbage collection
for one example)?



 Our JVM info:
 
 java version 1.6.0_21
 Java(TM) SE Runtime Environment (build 1.6.0_21-b06)
 Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode)
 
 And JVM memory allocation:  -Xms3G -Xmx3G
 
 Non-heap memory usage is 138MB.
 
 Any recommendation where should look to see why memory usage keep growing?
 
 Thanks!
 
 Huy

Are you using standard, mmap_index_only, or mmap io?  Are you using JNA?



Re: Default Listen Port

2011-02-09 Thread Chris Burroughs
On 02/09/2011 04:00 PM, jeremy.truel...@barclayscapital.com wrote:
 What's the easiest way to change the port nodes listen for comm on
 from other nodes? It appears that the default is 8080 which collides
 with my tomcat server on one of our dev boxes. I tried doing
 something in cassandra.yaml like
 
 listen_address: 192.1.fake.2:
 
 but that doesn't work it throws an exception. Also can you not put
 the actual name of servers in the config or does it always have to be
 the actual ip address currently? Thanks.
 


8080 is used by jmx [1].  You can change that in cassandra-env.sh.

hostnames are allowed.


[1] http://wiki.apache.org/cassandra/FAQ#ports


Re: OOM during batch_mutate

2011-02-08 Thread Chris Burroughs
On 02/07/2011 06:05 PM, Jonathan Ellis wrote:
 Sounds like the keyspace was created on the 32GB machine, so it
 guessed memtable sizes that are too large when run on the 16GB one.
 Use update column family from the cli to cut the throughput and
 operations thresholds in half, or to 1/4 to be cautious.


This guessing is new in 0.7.x right?  On a 0.6.x storage-conf.xml +
sstables can be moved among machines with different amounts of RAM
without needing to change anything through the cli?



Re: CF Read and Write Latency Histograms

2011-02-07 Thread Chris Burroughs
On 02/04/2011 12:43 PM, Jonathan Ellis wrote:
 Can you create a ticket?


I noticed the same thing. CASSANDRA-2123 created.


Re: 0.7.0 mx4j, get attribute

2011-02-03 Thread Chris Burroughs
On 02/02/2011 01:41 PM, Ryan King wrote:
 On Wed, Feb 2, 2011 at 10:40 AM, Chris Burroughs
 chris.burrou...@gmail.com wrote:
 I'm using 0.7.0 and experimenting with the new mx4j support.

 http://host:port/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage

 Returns a nice pretty html page.  For purposes of monitoring I would
 like to get a single attribute as xml.  The docs [1] decribe a
 getattribute endpoint.  But I have been unable to get anything other
 than a blank response from that.  mx4j does not seem to include any
 logging for troubleshooting.

 Example:
 http://host:port/getattribute?objectname=org.apache.cassandra.request%3atype%3dReadStageattribute=PendingTasks

 returns 200 OK with no data.

 If anyone could point out what embarrassingly simple mistake I am making
 I would be much obliged.


 [1] http://mx4j.sourceforge.net/docs/ch05.html

 
 Note that many objects in cassandra aren't initialized until they're
 used for the first time.
 
 -ryan

But if I can access them through jconsole just fine I don't see what
would be stopping mx4j.


Re: 0.7.0 mx4j, get attribute

2011-02-03 Thread Chris Burroughs
On 02/03/2011 11:29 AM, Ran Tavory wrote:
 Try adding this to the end of the URL: ?template=identity
 


That works, thanks!


Re: reduced cached mem; resident set size growth

2011-02-02 Thread Chris Burroughs
On 01/28/2011 09:19 PM, Chris Burroughs wrote:
 Thanks Oleg and Zhu.  I swear that wasn't a new hotspot version when I
 checked, but that's obviously not the case.  I'll update one node to the
 latest as soon as I can and report back.


RSS over 48 hours with java 6 update 23:

http://img716.imageshack.us/img716/5202/u2348hours.png

I'll continue monitoring but RSS still appears to grow without bounds.
Zhu reported a similar problem with Ubuntu 10.04.  While possible, it
would seem seam extraordinary unlikely that there is a glibc or kernel
bug affecting us both.



  1   2   >