Re: CQL Composite Key Seen After Table Creation
On 01/06/2016 04:47 PM, Robert Coli wrote: On Wed, Jan 6, 2016 at 12:54 PM, Chris Burroughs <chris.burrou...@gmail.com> wrote: The problem with that approach is that manually editing the local schema tables in live cluster is wildly dangerous. I *think* this would work: * Make triple sure no schema changes are happening on the cluster. * Update schema tables on each node --> drain --> restart I think that would work too, and probably be lower risk than modifying on one and trying to get the others to pull via resetlocalschema. But I agree it seems "wildly dangerous". We did this, and a day later it appears successful. I am still fuzzy on how schema "changes" propagate when you edit the schema tables directly and am unsure if the drain/restart rain dance was strictly necessary, but it felt safer. (Obviously even if I was sure now, that would not be behavior to count on, and I hope not to need to do this gain.)
Re: CQL Composite Key Seen After Table Creation
I work with Amir and further experimentation I can shed a little more light on what exactly is going on under the hood. For background our goal is to take data that is currently being read and written to via thrift, switch reads to CQL, and then switch writes to CQL. This is in alternative to deleting all of our data and starting over, or being forever struck on super old thrift clients (both of those options obviously suck.) The data models involved are absurdly simple (and single key with a handful of static columns). TLDR: Metadata is complicated. What is the least dangerous way to make direct changes to system.schema_columnfamilies and system.schema_columns? Anyway, given some super simple Foo and Bar column families: create keyspace Test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; use Test; create column family Foo with comparator = UTF8Type and key_validation_class=UTF8Type and column_metadata = [ {column_name: title, validation_class: UTF8Type}]; create column family Bar with comparator = UTF8Type and key_validation_class=UTF8Type; update column family Bar with column_metadata = [ {column_name: title, validation_class: UTF8Type}]; (The salient difference as described by Amir is when the column_metadata is set; at the same time as creation or later.) Now we can inject a little data and see that from thrift everything looks fine: [default@Test] set Foo['testkey']['title']='mytitle'; Value inserted. Elapsed time: 19 msec(s). [default@Test] set Bar['testkey']['title']='mytitle'; Value inserted. Elapsed time: 4.47 msec(s). [default@Test] list Foo; Using default limit of 100 Using default cell limit of 100 --- RowKey: testkey => (name=title, value=mytitle, timestamp=1452108082972000) 1 Row Returned. Elapsed time: 268 msec(s). [default@Test] list Bar; Using default limit of 100 Using default cell limit of 100 --- RowKey: testkey => (name=title, value=mytitle, timestamp=1452108093739000) 1 Row Returned. Elapsed time: 9.3 msec(s). But from cql the Bar column does not look like the data we wrote: cqlsh> select * from "Test"."Foo"; key | title -+- testkey | mytitle (1 rows) cqlsh> select * from "Test"."Bar"; key | column1 | value| title -+-+--+- testkey | title | 0x6d797469746c65 | mytitle It's not just that these phantom columns are ugly, cql thinks column1 is part of a composite primary key. Since there **is no column1**, that renderes the data un-query-able with WHERE clauses. Just to make sure it's not thrift that is doing something unexpected, the sstables show the expected structure: $ ./tools/bin/sstable2json /data/sstables/data/Test/Foo-d3348860b4af11e5b456639406f48f1b/Test-Foo-ka-1-Data.db [ {"key": "testkey", "cells": [["title","mytitle",1452110466924000]]} ] $ ./tools/bin/sstable2json /data/sstables/data/Test/Foo-d3348860b4af11e5b456639406f48f1b/Test-Foo-ka-1-Data.db [ {"key": "testkey", "cells": [["title","mytitle",1452110466924000]]} ] So, what appeared as innocent variation made years ago when the thrift schema was written causes very different results to cql. Digging into the schema tables shows what is going on in more detail: > select > keyspace_name,columnfamily_name,column_aliases,comparator,is_dense,key_aliases,value_alias > from system.schema_columnfamilies where keyspace_name='Test'; keyspace_name | columnfamily_name | column_aliases | comparator | is_dense | key_aliases | value_alias ---+---++ +--+-+- Test | Bar | ["column1"] | org.apache.cassandra.db.marshal.UTF8Type | True | ["key"] | value Test | Foo | [] | org.apache.cassandra.db.marshal.UTF8Type |False | ["key"] |null > select keyspace_name,columnfamily_name,column_name,validator from > system.schema_columns where keyspace_name='Test'; keyspace_name | columnfamily_name | column_name | validator ---+---+-+--- Test | Bar | column1 | org.apache.cassandra.db.marshal.UTF8Type Test | Bar | key | org.apache.cassandra.db.marshal.UTF8Type Test | Bar | title | org.apache.cassandra.db.marshal.UTF8Type Test | Bar | value | org.apache.cassandra.db.marshal.BytesType Test | Foo | key | org.apache.cassandra.db.marshal.UTF8Type Test | Foo | title | org.apache.cassandra.db.marshal.UTF8Type Now the interesting bit is that the metadata can be manually "fixed": UPDATE
Re: Migration 1.2.14 to 2.0.8 causes Tried to create duplicate hard link at startup
Were you able to solve or work around this problem? On 06/05/2014 11:47 AM, Tom van den Berge wrote: Hi, I'm trying to migrate a development cluster from 1.2.14 to 2.0.8. When starting up 2.0.8, I'm seeing the following error in the logs: INFO 17:40:25,405 Snapshotting drillster, Account to pre-sstablemetamigration ERROR 17:40:25,407 Exception encountered during startup java.lang.RuntimeException: Tried to create duplicate hard link to /Users/tom/cassandra-data/data/drillster/Account/snapshots/pre-sstablemetamigration/drillster-Account-ic-65-Filter.db at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:75) at org.apache.cassandra.db.compaction.LegacyLeveledManifest.snapshotWithoutCFS(LegacyLeveledManifest.java:129) at org.apache.cassandra.db.compaction.LegacyLeveledManifest.migrateManifests(LegacyLeveledManifest.java:91) at org.apache.cassandra.db.compaction.LeveledManifest.maybeMigrateManifests(LeveledManifest.java:617) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:274) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) Does anyone have an idea how to solve this? Thanks, Tom
Re: New node Unable to gossip with any seeds
This generally means that how you are describing the see nodes address doesn't match how it's described in the second node seeds list in the correct way. CASSANDRA-6523 has some links that might be helpful. On 05/26/2014 12:07 AM, Tim Dunphy wrote: Hello, I am trying to spin up a new node using cassandra 2.0.7. Both nodes are at Digital Ocean. The seed node is up and running and I can telnet to port 7000 on that host from the node I'm trying to start. [root@cassandra02 apache-cassandra-2.0.7]# telnet 10.10.1.94 7000 Trying 10.10.1.94... Connected to 10.10.1.94. Escape character is '^]'. But when I start cassandra on the new node I see the following exception: INFO 00:01:34,744 Handshaking version with /10.10.1.94 ERROR 00:02:05,733 Exception encountered during startup java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447) at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:505) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569) java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447) at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:505) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569) Exception encountered during startup: Unable to gossip with any seeds ERROR 00:02:05,742 Exception in thread Thread[StorageServiceShutdownHook,5,main] java.lang.NullPointerException at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1270) at org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:573) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.lang.Thread.run(Thread.java:745) I'm using the murmur3 partition on both nodes and I have the seed node's IP listed in the cassandra.yaml of the new node. I'm just wondering what the issue might be and how I can get around it. Thanks Tim
Re: alternative vnode upgrade strategy?
On 05/28/2014 02:18 PM, William Oberman wrote: 1.) Upgrade all N nodes to vnodes in place Start loop 2.) Boot a new node and let it bootstrap 3.) Decommission an old node End loop I's been a while since I had to think about the vnode migration, but I've think this would fall pray to https://issues.apache.org/jira/browse/CASSANDRA-5525
Re: Is the tarball for a given release in a Maven repository somewhere?
Maven central has bin.tar.gz src.tar.gz downloads for the 'apache-cassandra' artifact. Does that work for your use case? http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22apache-cassandra%22 On 05/20/2014 05:30 PM, Clint Kelly wrote: Hi all, I am using the maven assembly plugin to build a project that contains a development environment for a project that we've built at work on top of Cassandra. I'd like this development environment to include the latest release of Cassandra. Is there a maven repo anywhere that contains an artifact with the Cassandra release in it? I'd like to have the same Cassandra tarball that you can download from the website be a dependency for my project. I can then have the assembly plugin untar it and customize some of the conf files before taring up our entire development environment. That way, anyone using our development environment would have access to the various shell scripts and tools. I poked around online and could not find what I was looking for. Any help would be appreciated! Best regards, Clint
Re: What does the rate signify for latency in the JMX Metrics?
They are exponential decaying moving averages (like Unix load averages) of the number of events per unit of time. http://wiki.apache.org/cassandra/Metrics might help On 04/17/2014 06:06 PM, Redmumba wrote: Good afternoon, I'm attempting to integrate the metrics generated via JMX into our internal framework; however, the information for several of the metrics includes a One/Five/Fifteen-minute rate, with the RateUnit in SECONDS. For example: $get -b org.apache.cassandra.metrics:name=Latency,scope=Write,type=ClientRequest * #mbean = org.apache.cassandra.metrics:name=Latency,scope=Write,type=ClientRequest: LatencyUnit = MICROSECONDS; EventType = calls; RateUnit = SECONDS; MeanRate = 383.6944837362387; FifteenMinuteRate = 868.8420188648543; FiveMinuteRate = 817.5239450236011; OneMinuteRate = 675.7673129014964; Max = 498867.0; Count = 31257426; Min = 52.0; 50thPercentile = 926.0; Mean = 1063.114029159023; StdDev = 1638.1542477604232; 75thPercentile = 1064.75; 95thPercentile = 1304.55; 98thPercentile = 1504.39992; 99thPercentile = 2307.35104; 999thPercentile = 10491.8502; What does the rate signify in this context? For example, given the OneMinuteRate of 675.7673129014964 and the unit of seconds--what is this measuring? Is this the rate of which metrics are submitted? i.e., there were an average of (676 * 60 seconds) metrics submitted over the last minute? Thanks!
Re: Backup procedure
It's also good to note that only the Data files are compressed already. Depending on your data the Index and other files may be a significant percent of total on disk data. On 05/02/2014 01:14 PM, tommaso barbugli wrote: In my tests compressing with lzop sstables (with cassandra compression turned on) resulted in approx. 50% smaller files. Thats probably because the chunks of data compressed by lzop are way bigger than the average size of writes performed on Cassandra (not sure how data is compressed but I guess it is done per single cell so unless one stores) 2014-05-02 19:01 GMT+02:00 Robert Coli rc...@eventbrite.com: On Fri, May 2, 2014 at 2:07 AM, tommaso barbugli tbarbu...@gmail.comwrote: If you are thinking about using Amazon S3 storage I wrote a tool that performs snapshots and backups on multiple nodes. Backups are stored compressed on S3. https://github.com/tbarbugli/cassandra_snapshotter https://github.com/JeremyGrosser/tablesnap SSTables in Cassandra are compressed by default, if you are re-compressing them you may just be wasting CPU.. :) =Rob
Re: row caching for frequently updated column
You are close. On 04/30/2014 12:41 AM, Jimmy Lin wrote: thanks all for the pointers. let' me see if I can put the sequences of event together 1.2 people mis-understand/mis-use row cache, that cassandra cached the entire row of data even if you are only looking for small subset of the row data. e.g select single_column from a_wide_row_table will result in entire row cached even if you are only interested in one single column of a row. Yep! 2.0 and because of potential misuse of heap memory, Cassandra 2.0 remove heap cache, and only support off-heap cache, which has a side effect that write will invalidate the row cache(my original question) off-heap is a common but misleading name for the SerializingCacheProvider. It still stores several objects on heap per cached item and has to deser on read. 2.1 the coming 2.1 Cassandra will offer true cache by query, so the cached data will be much more efficient even for wide rows(it cached what it needs). do I get it right? for the new 2.1 row caching, is it still true that a write or update to the row will still invalidate the cached row ? I don't think true cache by query is an accurate description of CASSANDRA-5357. I think it's more like a head of the row cache.
Re: Thrift Server Implementations
On 02/13/2014 01:37 PM, Christopher Wirt wrote: Anyway, today I moved the old HsHa implementation and the new TThreadSelectorServer into a 2.0.5 checkout, hooked them in, built, did a bit of testing and I'm now running live. We found the TThreadSelectorServer performed the best getting us back under our SLA. Are you still running with the upstream TThreadSelectorServer? Based on your experience is there any reason Cassandra should not adapt it.
Re: mixed nodes, some SSD some HD
No. If you have a heterogeneous clusters you should consider adjusting the number of vnodes per physical node. On 03/04/2014 10:47 PM, Elliot Finley wrote: Using Cassandra 2.0.x If I have a 3 node cluster and 2 of the nodes use spinning drives and 1 of them uses SSD, will the majority of the reads be routed to the SSD node automatically because it has faster responses? TIA, Elliot
Re: ring describe returns only public ips
More generally, a thrift api or other mechanism for Astyanax to get the INTERNAL_IP seems necessary to use ConnectionPoolType.TOKEN_AWARE + NodeDiscoveryType.TOKEN_AWARE in a multi-dc setup. Absent one I'm confused how that combination is possible. On 02/06/2014 03:17 PM, Ted Pearson wrote: We are using Cassandra 1.2.13 in a multi-datacenter setup. We are using Astyanax as the client, and we’d like to enable its token aware connection pool type and ring describe node discovery type. Unfortunately, I’ve found that both thrift’s describe_ring and `nodetool ring` only report the public IPs of the cassandra nodes. This means that Astyanax tries to reconnect to the public IPs of each node, which doesn’t work and just results in no hosts being available for queries according to Astyanax. I know from `nodetool gossipinfo` (and the fact that the clusters work) that it's sharing the LOCAL_IP via gossip, but have no idea how or if it’s possible to get describe_ring to return local IPs, or if there is some alternative. Thanks, -Ted
Re: Question about local reads with multiple data centers
On 01/29/2014 08:07 PM, Donald Smith wrote: My question: will the read process try to read first locally from the datacenter DC2 I specified in its connection string? I presume so. (I doubt that it uses the client's IP address to decide which datacenter is closer. And I am unaware of another way to tell it to read locally.) From the rest if this thread it looks like you were asking about how the client selected a Cassandra node to act as a coordinator. Note however that if you are using a DC oblivious CL (ONE, QUORUM) then that Cassandra coordinator may send requests to the remote data center. Also, will read repair happen between datacenters automatically (read_repair_chance=0.10)? Or does that only happen within a single data center? Yes read_repair_chance is global. There is a separate dc_local repair chance if you want to make local reap repairs more common.
Re: what tool will create noncql columnfamilies in cassandra 3a
On 02/05/2014 04:57 AM, Sylvain Lebresne wrote: How will users adjust the meta data of non cql column families The rational for removing cassandra-cli is mainly that maintaining 2 fully featured command line interface is a waste of the project resources in the long run. It's just a tool using the thrift interface however and you'll still be able to adjust metadata through the thrift interface as before. As Patricia mentioned, there is even some existing interactive options like pycassaShell in the community. It's also wasteful for the community to maintain multiple post 3.0 forks for cassandra-cli so they can continue using Cassandra. It would be more efficient if they cool pool their resources in a central place, like a code repo at Apache.
Re: First SSTable file is not being compacted
On 02/06/2014 01:17 AM, Sameer Farooqui wrote: I'm running C* 2.0.4 and when I have a handful of SSTable files and trigger a manual compaction with 'nodetool compact' the first SSTable file doesn't get compacted away. Is there something special about the first SSTable that it remains even after a SizedTierCompaction? No, this is not expected behavior. Do the number of live SSTables reported match what is on disk? Do you have a procedure that can repeat this?
Re: First SSTable file is not being compacted
Sounds like you have done some solid test work. I suggest reading https://issues.apache.org/jira/browse/CASSANDRA-6568 and if you think your issue is the same adding your reproduction case there, otherwise create your own ticket. On 02/06/2014 10:53 AM, Sameer Farooqui wrote: Yeah, it's definitely repeatable. I have a lab environment set up where the issue is occurring and I've recreated the lab environment 4 - 5 times and it's occurred each time. In my demodb.users CF I currently have 2 data SSTables on disk (demodb-users-jb-1-Data.db and demodb-users-jb-6-Data.db). However, in OpsCenter the CF: SSTable Count (demodb.users) graph shows only one SSTable. The nodetool cfstats command also shows SSTable count: 1 for this CF. - SF On Thu, Feb 6, 2014 at 8:54 AM, Chris Burroughs chris.burrou...@gmail.comwrote: On 02/06/2014 01:17 AM, Sameer Farooqui wrote: I'm running C* 2.0.4 and when I have a handful of SSTable files and trigger a manual compaction with 'nodetool compact' the first SSTable file doesn't get compacted away. Is there something special about the first SSTable that it remains even after a SizedTierCompaction? No, this is not expected behavior. Do the number of live SSTables reported match what is on disk? Do you have a procedure that can repeat this?
Re: Row cache vs. OS buffer cache
My experience has been that the row cache is much more effective. However, reasonable row cache sizes are so small relative to RAM that I don't see it as a significant trade-off unless it's in a very memory constrained environment. If you want to enable the row cache (a big if) you probably want it to be as big as it can be until you have reached the point of diminishing returns on the hit rate. The off-heap cache still has many on-heap objects so it's doesn't really change that much conceptually, you will just end up with a different number for the size. On 01/23/2014 02:13 AM, Katriel Traum wrote: Hello list, I was if anyone has any pointers or some advise regarding using row cache vs leaving it up to the OS buffer cache. I run cassandra 1.1 and 1.2 with JNA, so off-heap row cache is an option. Any input appreciated. Katriel
nodetool cleanup / TTL
This has not reached a consensus in #cassandra in the past. Does `nodetool cleanup` also remove data that has expired from a TTL?
Re: nodetool cleanup / TTL
On 01/07/2014 01:38 PM, Tyler Hobbs wrote: On Tue, Jan 7, 2014 at 7:49 AM, Chris Burroughs chris.burrou...@gmail.comwrote: This has not reached a consensus in #cassandra in the past. Does `nodetool cleanup` also remove data that has expired from a TTL? No, cleanup only removes rows that the node is not a replica for. Is there some other mechanism for forcing expired data to be removed without also compacting? (major compaction having obvious problematic side effects, and user defined compaction being significant work to script up).
Re: vnode in production
On 01/02/2014 01:51 PM, Arindam Barua wrote: 1. the stability of vnodes in production I'm happily using vnodes in production now, but I would have trouble calling them stable for more than small clusters until very recently (1.2.13). CASSANDRA-6127 served as a master ticket for most of the issues if you are interested in the details. 2. upgrading to vnodes in production I am not aware of anyone who has succeeded with shuffle in production, but the 'add a new DC' procedure works.
Re: vnode in production
On 01/06/2014 01:56 PM, Arindam Barua wrote: Thanks for your responses. We are on 1.2.12 currently. The fixes in 1.2.13 seem to help for clusters in the 500+ node range (like CASSANDRA-6409). Ours is below 50 now, so we plan to go ahead and enable vnodes with the 'add a new DC' procedure. We will try to upgrade to 1.2.13 or 1.2.14 subsequently. Your plan seems reasonable but in the interest of full disclosure CASSANDRA-6345 has been observed as a significant issue for clusters in the 50-75 node range.
Re: How to measure data transfer between data centers?
https://wiki.apache.org/cassandra/Metrics has per node Streaming metrics that include total bytes/in out. That is only a small bit of what you want though. For total DC bandwidth it might be more straightforward to measure this at the router/switch/fancy-network-gear level. On 12/03/2013 06:25 AM, Tom van den Berge wrote: Is there a way to know how much data is transferred between two nodes, or more specifically, between two data centers? I'm especially interested in how much data is being replicated from one data center to another, to know how much of the available bandwidth is used. Thanks, Tom
MiscStage Backup
I'm trying to debug a node that has a backup in MiscStage. Starting a bit under 24 hours ago the number of Pending tasks jumped to a bit under 400 and hovered around there. It looks like repair requests from other nodes (tpstats on this node shows AntiEntropySessions: 0, 0, 0, which I think indicates it did not originate the repair). After each MiscStage task completes a series of Streams are kicked off. I am confused why MiscStage is backing up: (A) This node has only been down a few hours over the past week so it should not be wildly out of sync (B) no other node in this cluster has had a comparable backup of pending Misc stages. Repairs are run on all nodes once a week. Physical resources on this node are not particularity saturated compared to the rest of the cluster; reads are slower but I can't tell cause from effect in that case. Graph of MiscStage pending tasks: http://imgur.com/sHqHTvt This is with a 1.2.11-ish dual-DC vnode cluster. MiscStage:1 daemon prio=10 tid=0x7f84e8598800 nid=0x43b2 waiting on condition [0x7f83c3734000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for 0x00069d23c700 (a java.util.concurrent.FutureTask$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281) at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:375) at org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:368) at org.apache.cassandra.streaming.StreamOut.flushSSTables(StreamOut.java:108) at org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:136) at org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:116) at org.apache.cassandra.streaming.StreamRequestVerbHandler.doVerb(StreamRequestVerbHandler.java:44) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Re: Endless loop LCS compaction
On 11/07/2013 06:48 AM, Desimpel, Ignace wrote: Total data size is only 3.5GB. Column family was created with SSTableSize : 10 MB You may want to try a significantly larger size. https://issues.apache.org/jira/browse/CASSANDRA-5727
Re: Why truncate previous hints when upgrade from 1.1.9 to 1.2.6?
NEWS.txt has some details and suggested procedures - The hints schema was changed from 1.1 to 1.2. Cassandra automatically snapshots and then truncates the hints column family as part of starting up 1.2 for the first time. Additionally, upgraded nodes will not store new hints destined for older (pre-1.2) nodes. It is therefore recommended that you perform a cluster upgrade when all nodes are up. Because hints will be lost, a cluster-wide repair (with -pr) is recommended after upgrade of all nodes. On 11/07/2013 07:33 AM, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 wrote: Hi all, When I upgrade C* from 1.1.9 to 1.2.6, I notice that the previous hintscolumnfamily would be directly truncated. Can you tell me why ? Because consistency is important to my services. Best Regards, Boole Guo
Re: Cassandra 1.1.6 - New node bootstrap not completing
On 11/01/2013 03:03 PM, Robert Coli wrote: On Fri, Nov 1, 2013 at 9:36 AM, Narendra Sharma narendra.sha...@gmail.comwrote: I was successfully able to bootstrap the node. The issue was RF 2. Thanks again Robert. For the record, I'm not entirely clear why bootstrapping two nodes into the same range should have caused your specific bootstrap problem, but I am glad to hear that bootstrapping one node at a time was a usable workaround. =Rob (A) If it can't work shouldn't a node refuse to bootstrap if it sees another node already in that state? (B) It would be nice if nodes in independent DCs could at least be bootstrapped at the same time.
Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled
On 11/06/2013 11:18 PM, Aaron Morton wrote: The default row cache is of the JVM heap, have you changed to the ConcurrentLinkedHashCacheProvider ? ConcurrentLinkedHashCacheProvider was removed in 2.0.x.
Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled
Both caches involve several objects per entry (What do we want? Packed objects. When do we want them? Now!). The size is an estimate of the off heap values only and not the total size nor number of entries. An acceptable size will depend on your data and access patterns. In one case we had a cluster that at 512mb would go into a GC death spiral despite plenty of free heap (presumably just due to the number of objects) while empirically the cluster runs smoothly at 384mb. Your caches appear on the larger size, I suggest trying smaller values and only increase when it produces measurable sustained gains. On 11/05/2013 04:04 AM, Jiri Horky wrote: Hi there, we are seeing extensive memory allocation leading to quite long and frequent GC pauses when using row cache. This is on cassandra 2.0.0 cluster with JNA 4.0 library with following settings: key_cache_size_in_mb: 300 key_cache_save_period: 14400 row_cache_size_in_mb: 1024 row_cache_save_period: 14400 commitlog_sync: periodic commitlog_sync_period_in_ms: 1 commitlog_segment_size_in_mb: 32 -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms10G -Xmx10G -Xmn1024M -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data2/cassandra-work/instance-1/cassandra-1383566283-pid1893.hprof -Xss180k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark We have disabled row cache on one node to see the difference. Please see attached plots from visual VM, I think that the effect is quite visible. I have also taken 10x jmap -histo after 5s on a affected server and plotted the result, attached as well. I have taken a dump of the application when the heap size was 10GB, most of the memory was unreachable, which was expected. The majority was used by 55-59M objects of HeapByteBuffer, byte[] and org.apache.cassandra.db.Column classes. I also include a list of inbound references to the HeapByteBuffer objects from which it should be visible where they are being allocated. This was acquired using Eclipse MAT. Here is the comparison of GC times when row cache enabled and disabled: prg01 - row cache enabled - uptime 20h45m - ConcurrentMarkSweep - 11494686ms - ParNew - 14690885 ms - time spent in GC: 35% prg02 - row cache disabled - uptime 23h45m - ConcurrentMarkSweep - 251ms - ParNew - 230791 ms - time spent in GC: 0.27% I would be grateful for any hints. Please let me know if you need any further information. For now, we are going to disable the row cache. Regards Jiri Horky
Re: The performance difference of online bulk insertion and the file-based bulk loading
On 10/15/2013 08:41 AM, José Elias Queiroga da Costa Araújo wrote: - is that is there a way that we can warm-up the cache, after the file-based bulk loading, so that we can allow the data to be cached first in the memory, and then afterwards, when we issue the bulk retrieval, the performance can be closer to what is provided by the online-bulk-insertion. Somewhat hacky, but you can at least warm of the OS page cache by `cat FILES /dev/null`
Re: nodetool status reporting dead node as UN
When debugging gossip related problems (is this node really down/dead/some-werid state) you might have better luck looking at `nodetool gossipinfo`. The UN even though everything is bad thing might be https://issues.apache.org/jira/browse/CASSANDRA-5913 I'm not sure what exactly what happened in your case. I'm also confused why an IP changed on restart. On 10/17/2013 06:12 PM, Philip Persad wrote: Hello, I seem to have gotten my cluster into a bit of a strange state. Pardon the rather verbose email, but there is a fair amount of background. I'm running a 3 node Cassandra 2.0.1 cluster. This particular cluster is used only rather intermittently for dev/testing and does not see particularly heavy use, it's mostly a catch-all cluster for environments which don't have a dedicated cluster to themselves. I noticed today that one of the nodes had died because nodetool repair was failing due to a down replica. I run nodetool status and sure enough, one of my nodes shows up as down. When I looked on the actual box, the cassandra process was up and running and everything in the logs looked sensible. The most controversial thing I saw was 1 CMS Garbage Collection per hour, each taking ~250 ms. None the less, the node was not responding, so I restarted it. So far so good, everything is starting up, my ~30 column families across ~6 key spaces are all initializing. The node then handshakes with my other two nodes and reports them both as up. Here is where things get strange. According to the logs on the other two nodes, the third node has come back up and all is well. However in the third node, I see a wall of the following in the logs (IP addresses masked): INFO [GossipTasks:1] 2013-10-17 20:22:25,652 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [GossipTasks:1] 2013-10-17 20:22:25,653 Gossiper.java (line 806) InetAddress /x.x.x.221 is now DOWN INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:25,655 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.222 INFO [RequestResponseStage:3] 2013-10-17 20:22:25,658 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [GossipTasks:1] 2013-10-17 20:22:26,654 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:26,657 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.222 INFO [RequestResponseStage:4] 2013-10-17 20:22:26,660 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [RequestResponseStage:3] 2013-10-17 20:22:26,660 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [GossipTasks:1] 2013-10-17 20:22:27,655 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:27,660 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.222 INFO [RequestResponseStage:4] 2013-10-17 20:22:27,662 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [RequestResponseStage:3] 2013-10-17 20:22:27,662 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [HANDSHAKE-/10.21.5.221] 2013-10-17 20:22:28,254 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.221 INFO [GossipTasks:1] 2013-10-17 20:22:28,657 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [RequestResponseStage:4] 2013-10-17 20:22:28,660 Gossiper.java (line 789) InetAddress /x.x.x.221 is now UP INFO [RequestResponseStage:3] 2013-10-17 20:22:28,660 Gossiper.java (line 789) InetAddress /x.x.x.221 is now UP INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:28,661 OutboundTcpConnection.java (line 386) Handshaking version with /x.x.x.222 INFO [RequestResponseStage:4] 2013-10-17 20:22:28,663 Gossiper.java (line 789) InetAddress /x.x.x.222 is now UP INFO [GossipTasks:1] 2013-10-17 20:22:29,658 Gossiper.java (line 806) InetAddress /x.x.x.222 is now DOWN INFO [GossipTasks:1] 2013-10-17 20:22:29,660 Gossiper.java (line 806) InetAddress /x.x.x.221 is now DOWN Additional, client requests to the cluster at consistency QUORUM start failing (saying 2 responses were required but only 1 replica responded). According to nodetool status, all the nodes are up. This is clearly not good. I take down the problem node. Nodetool reports it down and QUORUM client reads/writes start working again. In an attempt to get the cluster back into a good state, I delete all the data on the problem node and then bring it back up. The other two nodes log a changed host ID for the IP of the node I wiped and then handshake with it. The problem node also comes up, but reads/writes start failing again with the same error. I decide to take the problem node down again. However this time, even after the process is dead, nodetool and the other two nodes report that my third node is still up and requests to the cluster continue to fail. Running nodetool status against either of the live nodes shows that all nodes are up. Running nodetool status against
Re: Huge multi-data center latencies
On 10/21/2013 07:03 PM, Hobin Yoon wrote: Another question is how do you get the local DC name? Have a look at org.apache.cassandra.db.EndpointSnitchInfo.getDatacenter
Re: How to use Cassandra on-node storage engine only?
As far as I know this had not been done before. I would be interested in hearing how it turned out. On 10/23/2013 09:47 AM, Yasin Celik wrote: I am developing an application for data storage. All the replication, routing and data retrieving types of business are handled in my application. Up to now, the data is stored in memory. Now, I want to use Cassandra storage engine to flush data from memory into hard drive. I am not sure if that is a correct approach. My question: Can I use the Cassandra data storage engine only? I do not want to use Cassandra as a whole standalone product (In this case, I should run one independent Cassandra per node and my application act as if it is client of Cassandra. This idea will put a lot of burden on node since it puts unnecessary levels between my application and storage engine). I have my own replication, ring and routing code. I only need the on-node storage facilities of Cassandra. I want to embed cassandra in my application as a library.
vnode + multi dc migration
I know there is a good deal of interest [1] on feasible methods for enabling vnodes on clusters that did not start with them. We recently completed a migration from a production cluster not using vnodes and in a single DC to one using vnodes in two DCs. We used the just spin up a new DC and rebuild strategy instead of shuffle and it worked. The checklist was long but it really wasn't more complicated than that. Thanks to several people in #cassandra for suggesting the technique and reviewing procedures. One oddity we noticed is that when nodes in a new DC join (auto_bootstrap:false) CL.ONE performance tanked [2]. The spike is when the nodes came online, and the drop is when reads were switched to CL.LOCAL_QUORUM This only happened when the new DC was cross-continent (not a logical DC in the same colo). [1] http://mail-archives.apache.org/mod_mbox/cassandra-user/201308.mbox/%3CCAEDUwd12vhRJbPZpVJ6QzTOx3pwU=11hhgkkipghhgvosbj...@mail.gmail.com%3E [2] http://i.imgur.com/ZW5Ob8V.png
Re: Multi-dc restart impact
Thanks, double checked; reads are CL.ONE. On 10/10/2013 11:15 AM, J. Ryan Earl wrote: Are you doing QUORUM reads instead of LOCAL_QUORUM reads? On Wed, Oct 9, 2013 at 7:41 PM, Chris Burroughs chris.burrou...@gmail.comwrote: I have not been able to do the test with the 2nd cluster, but have been given a disturbing data point. We had a disk slowly fail causing a significant performance degradation that was only resolved when the sick node was killed. * Perf in DC w/ sick disk: http://i.imgur.com/W1I5ymL.**png?1http://i.imgur.com/W1I5ymL.png?1 * perf in other DC: http://i.imgur.com/gEMrLyF.**png?1http://i.imgur.com/gEMrLyF.png?1 Not only was a single slow node able to cause an order of magnitude performance hit in a dc, but the other dc faired *worse*. On 09/18/2013 08:50 AM, Chris Burroughs wrote: On 09/17/2013 04:44 PM, Robert Coli wrote: On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs chris.burrou...@gmail.com**wrote: We have a 2 DC cluster running cassandra 1.2.9. They are in actual physically separate DCs on opposite coasts of the US, not just logical ones. The primary use of this cluster is CL.ONE reads out of a single column family. My expectation was that in such a scenario restarts would have minimal impact in the DC where the restart occurred, and no impact in the remote DC. We are seeing instead that restarts in one DC have a dramatic impact on performance in the other (let's call them DCs A and B). Did you end up filing a JIRA on this, or some other outcome? =Rob No. I am currently in the process of taking a 2nd cluster from being single to dual DC. Once that is done I was going to repeat the test with each cluster and gather as much information as reasonable.
Re: Multi-dc restart impact
I have not been able to do the test with the 2nd cluster, but have been given a disturbing data point. We had a disk slowly fail causing a significant performance degradation that was only resolved when the sick node was killed. * Perf in DC w/ sick disk: http://i.imgur.com/W1I5ymL.png?1 * perf in other DC: http://i.imgur.com/gEMrLyF.png?1 Not only was a single slow node able to cause an order of magnitude performance hit in a dc, but the other dc faired *worse*. On 09/18/2013 08:50 AM, Chris Burroughs wrote: On 09/17/2013 04:44 PM, Robert Coli wrote: On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs chris.burrou...@gmail.comwrote: We have a 2 DC cluster running cassandra 1.2.9. They are in actual physically separate DCs on opposite coasts of the US, not just logical ones. The primary use of this cluster is CL.ONE reads out of a single column family. My expectation was that in such a scenario restarts would have minimal impact in the DC where the restart occurred, and no impact in the remote DC. We are seeing instead that restarts in one DC have a dramatic impact on performance in the other (let's call them DCs A and B). Did you end up filing a JIRA on this, or some other outcome? =Rob No. I am currently in the process of taking a 2nd cluster from being single to dual DC. Once that is done I was going to repeat the test with each cluster and gather as much information as reasonable.
gossip settling and bootstrap problems
I've been running into a variety of tricky to diagnose problems recently that could be summarized as bootstrap related tasks fail without extra hacky sleep time. This is a sample edited log file for bootstrapping a node that captures the general dynamics: http://pastebin.com/yeN9USLt This build has been modified (from 1.2.10) to sleep 4*RING_DELAY in StorageService.bootstrap(). A few notes: * At 30s nodes are still flapping UP and DOWN * handshaking is still going strong at 90s * Things do stabilize; they don't flap indefinitely * Bootstrap succeeds once it starts. In this particular cluster a default RING_DELAY/build (30s) fails every time. Ping times, TCP retransmit, and other general network stuff look fine. There are several different tickets (some from me) that reference what seemed to me to be possibly similar or at least correlated issues: * CASSANDRA-4288 : prevent thrift server from starting before gossip has settled * CASSANDRA-5815 : NPE from migration manager * CASSANDRA-5915 : node flapping prevents replace_node from succeeding consistently * CASSANDRA-6156 : Poor resilience and recovery for bootstrapping node - unable to fetch range * CASSANDRA-6127 : vnodes don't scale to hundreds of nodes I suspect that a combination of factors is causing gossip to take longer to stabilize: * vnodes * (cross country or greater) multi-dc * bigger than a test cluster ( 50 nodes) * reconnecting snitch What are other people seeing in their clusters? Doe anyone routinely change RING_DELAY (google finds precious few references)?
Re: Nodes separating from the ring
I have observed one problem with an inconsistent ring that is superficially similar (node thinks it's up but peers disagree) and noted details in CASSANDRA-6082. However, it does not sound like the details of either the symptoms, or the resolution match what you describe. If you have not already, running nodetool goossipinfo might give you more clues than `status`. On 09/13/2013 10:48 AM, Dave Cowen wrote: Hi, all - We've been running Cassandra 1.1.12 in production since February, and have experienced a vexing problem with an arbitrary node falling out of or separating from the ring on occasion. When a node falls out of the ring, running nodetool ring on the misbehaving node shows that the misbehaving node believes that is Up, but that the rest of the ring is Down, and the rest of the ring has question marks listed for load. nodetool ring on any of the other nodes, however, shows the misbehaving node as Down but everything else is up. Shutting down and restarting the misbehaving node does not result in changed behavior. We can only get the misbehaving node to rejoin the ring by shutting it down, running nodetool removetoken misbehaving node token and nodetool removetoken force elsewhere in the ring. After the node's token has been removed from the ring, it will rejoin and behave normally when it is restarted. This is not a frequent occurrence - we can go months between this happening. It most commonly occurs when a different node is brought down and then back up, but it can happen spontaneously. This is also not associated with a network connectivity event; we've seen no interruption in the nodes being able to communicate over the network. As above, it's also not isolated to a single node; we've seen this behavior on multiple nodes. This has occurred with both the identical seeds specified in cassandra.yaml on each node, and also when we remove the node from its own seed list (so any seed won't try to auto-bootstrap from itself). Seeds have always been up and available. Has anyone else seen similar behavior? For obvious reasons, we hate seeing one of the nodes suddenly fall out and require intervention when we flap another node, or for no reason at all. Thanks, Dave
Re: Multi-dc restart impact
On 09/17/2013 04:44 PM, Robert Coli wrote: On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs chris.burrou...@gmail.comwrote: We have a 2 DC cluster running cassandra 1.2.9. They are in actual physically separate DCs on opposite coasts of the US, not just logical ones. The primary use of this cluster is CL.ONE reads out of a single column family. My expectation was that in such a scenario restarts would have minimal impact in the DC where the restart occurred, and no impact in the remote DC. We are seeing instead that restarts in one DC have a dramatic impact on performance in the other (let's call them DCs A and B). Did you end up filing a JIRA on this, or some other outcome? =Rob No. I am currently in the process of taking a 2nd cluster from being single to dual DC. Once that is done I was going to repeat the test with each cluster and gather as much information as reasonable.
Re: I don't understand shuffle progress
On 09/17/2013 09:41 PM, Paulo Motta wrote: So you're saying the only feasible way of enabling VNodes on an upgraded C* 1.2 is by doing fork writes to a brand new cluster + bulk load of sstables from the old cluster? Or is it possible to succeed on shuffling, even if that means waiting some weeks for the shuffle to complete? In a multi DC cluster situation you *should* be able to bring up a new DC with vnodes, bootstrap it, and then decommission the old cluster.
Re: I don't understand shuffle progress
http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_add_dc_to_cluster_t.html This is a basic outline. On 09/18/2013 10:32 AM, Juan Manuel Formoso wrote: I really like this idea. I can create a new cluster and have it replicate the old one, after it finishes I can remove the original. Any good resource that explains how to add a new datacenter to a live single dc cluster that anybody can recommend? On Wed, Sep 18, 2013 at 9:58 AM, Chris Burroughs chris.burrou...@gmail.comwrote: On 09/17/2013 09:41 PM, Paulo Motta wrote: So you're saying the only feasible way of enabling VNodes on an upgraded C* 1.2 is by doing fork writes to a brand new cluster + bulk load of sstables from the old cluster? Or is it possible to succeed on shuffling, even if that means waiting some weeks for the shuffle to complete? In a multi DC cluster situation you *should* be able to bring up a new DC with vnodes, bootstrap it, and then decommission the old cluster.
Multi-dc restart impact
We have a 2 DC cluster running cassandra 1.2.9. They are in actual physically separate DCs on opposite coasts of the US, not just logical ones. The primary use of this cluster is CL.ONE reads out of a single column family. My expectation was that in such a scenario restarts would have minimal impact in the DC where the restart occurred, and no impact in the remote DC. We are seeing instead that restarts in one DC have a dramatic impact on performance in the other (let's call them DCs A and B). Test scenario on a node in DC A: * disablegossip: no change * drain: no change * stop node: no change * start node again: Large increase in latency in both DCs A *and* B This is a graph showing the increase in latency (org.apache.cassandra.metrics.ClientRequest.Latency.Read.95percentile) from DC *B* http://i.imgur.com/OkIQyXI.png (Actual clients report similar numbers that agree with this server side measurement). Latency jumps by over an order of magnitude and out of SLAs. (I would prefer restarting to not cause a latency spike in either DC, but the one induced in the remote DC is particularly concerning.) However, the node that was restarted reports only a minor increase in latency http://i.imgur.com/KnGEJrE.png This is confusing from several different angles: * I would not expect any cross-dc reads to normally be occurring * If there were cross DC reads, they would take 50+ ms instead of 5 ms normally reported * If the node that was restarted was still somehow involved it reads, it's reporting shows it can only account for a small amount of the latency increase. Some possible relevant configurations: * GossipingPropertyFileSnitch * dynamic_snitch_update_interval_in_ms: 100 * dynamic_snitch_reset_interval_in_ms: 60 * dynamic_snitch_badness_threshold: 0.1 * read_repair_chance=0.01 and dclocal_read_repair_chance=0.1 (same type of behavior was observed with just read_repair_chance=0.1) Has anyone else observed similar behavior and found a way to limit it? This seems like something that ought not to happen but without knowing why it is occurring I'm not sure how to stop it.
Re: row cache
On 09/01/2013 03:06 PM, Faraaz Sareshwala wrote: Yes, that is correct. The SerializingCacheProvider stores row cache contents off heap. I believe you need JNA enabled for this though. Someone please correct me if I am wrong here. The ConcurrentLinkedHashCacheProvider stores row cache contents on the java heap itself. Naming things is hard. Both caches are in memory and are backed by a ConcurrentLinkekHashMap. In the case of the SerializingCacheProvider the *values* are stored in off heap buffers. Both must store a half dozen or so objects (on heap) per entry (org.apache.cassandra.cache.RowCacheKey, com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue, java.util.concurrent.ConcurrentHashMap$HashEntry, etc). It would probably be better to call this a mixed-heap rather than off-heap cache. You may find the number of entires you can hold without gc problems to be surprising low (relative to say memcached, or physical memory on modern hardware). Invalidating a column with SerializingCacheProvider invalidates the entire row while with ConcurrentLinkedHashCacheProvider it does not. SerializingCacheProvider does not require JNA. Both also use memory estimation of the size (of the values only) to determine the total number of entries retained. Estimating the size of the totally on-heap ConcurrentLinkedHashCacheProvider has historically been dicey since we switched from sizing in entries, and it has been removed in 2.0.0. As said elsewhere in this thread the utility of the row cache varies from absolutely essential to source of numerous problems depending on the specifics of the data model and request distribution.
multi-dc clusters with 'local' ips and no vpn
Cassandra makes the totally reasonable assumption that the entire cluster is in one routable address space. We unfortunately had a situation where: * nodes can talk to each other in the same dc on an internal address, but not talk to each other over their external 1:1 NAT address. * nodes can talk to nodes in the other dc over the external address, but there is no usable shared internal address space they can talk over In case anyone else finds themselves in the same situation we have what we think is a working solution in pre-production. CASSANDRA-5630 handles the reconnect trick to prefer the local ip when in the same DC. And some iptables rules allow the local nodes to do the initial gossiping with each other before that switch. for each node in same dc: 'iptables -t nat -A OUTPUT -j DNAT -p tcp --dst %s --dport 7000 -o eth0 --to-destination %s' % (ext_ip, local_ip)
SurgeCon 2012
Surge [1] is scalability focused conference in late September hosted in Baltimore. It's a pretty cool conference with a good mix of operationally minded people interested in scalability, distributed systems, systems level performance and good stuff like that. You should go! [2] For those of you who like historical trivia Mike Malone gave a well recieved Cassandra talk at the first SurgeCon in 2010 [3]. This year there is organised room for BoF's and such with several one-hour slots Wednesday and Thursday evenings, between 9 p.m. and midnight for BoFs. Last year a few of us got together informally around lunch time [4]. Interested in getting together again this year? Think we have critical mass for a BoF? [1] http://omniti.com/surge/2012 [2] http://omniti.com/surge/2012/register [3] http://omniti.com/surge/2010/speakers/mike-malone [4] http://mail-archives.apache.org/mod_mbox/cassandra-user/201109.mbox/%3c4e82140a.5070...@gmail.com%3E
Re: Distinct Counter Proposal for Cassandra
On 06/13/2012 01:00 PM, Yuki Morishita wrote: The above implementation and most of the other ones (including stream-lib) implement the optimized version of the algorithm which counts up to 10^9, so may need some work. Other alternative is self-learning bitmap (http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my understanding, is more memory efficient when counting small values. The closest we could get to a one-size fits all would probably be an adaptive counting scheme that uses linear counting (or self-learning bitmap, didn't know about that one!) for small expected cardinalities and a LogLog variant for higher ones. It's more choices to make, but choosing between not too big and really really big doesn't seem like an unreasonable burden to me.
Re: Distinct Counter Proposal for Cassandra
Well I obviously think it would be handy. If this get's proposed and end's up using stream-lib don't be shy about asking for help. On a more general note, it would be great to see the special case Counter code become more general atomic operation code. On 06/13/2012 01:15 PM, Utku Can Topçu wrote: Hi Yuki, I think I should have used the word discussion instead of proposal for the mailing subject. I have quite some of a design in my mind but I think it's not yet ripe enough to formalize. I'll try to simplify it and open a Jira ticket. But first I'm wondering if there would be any excitement in the community for such a feature. Regards, Utku On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita mor.y...@gmail.com wrote: You can open JIRA ticket at https://issues.apache.org/jira/browse/CASSANDRA with your proposal. Just for the input: I had once implemented HyperLogLog counter to use internally in Cassandra, but it turned out I didn't need it so I just put it to gist. You can find it here: https://gist.github.com/2597943 The above implementation and most of the other ones (including stream-lib) implement the optimized version of the algorithm which counts up to 10^9, so may need some work. Other alternative is self-learning bitmap ( http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my understanding, is more memory efficient when counting small values. Yuki On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote: Hi All, Let's assume we have a use case where we need to count the number of columns for a given key. Let's say the key is the URL and the column-name is the IP address or any cardinality identifier. The straight forward implementation seems to be simple, just inserting the IP Adresses as columns under the key defined by the URL and using get_count to count them back. However the problem here is in case of large rows (where too many IP addresses are in); the get_count method has to de-serialize the whole row and calculate the count. As also defined in the user guides, it's not an O(1) operation and it's quite costly. However, this problem seems to have better solutions if you don't have a strict requirement for the count to be exact. There are streaming algorithms that will provide good cardinality estimations within a predefined failure rate, I think the most popular one seems to be the (Hyper)LogLog algorithm, also there's an optimal one developed recently, please check http://dl.acm.org/citation.cfm?doid=1807085.1807094 If you want to take a look at the Java implementation for LogLog, Clearspring has both LogLog and space optimized HyperLogLog available at https://github.com/clearspring/stream-lib I don't see a reason why this can't be implemented in Cassandra. The distributed nature of all these algorithms can easily be adapted to Cassandra's model. I think most of us would love to see come cardinality estimating columns in Cassandra. Regards, Utku
Re: Row caching in Cassandra 1.1 by column family
Check out the rows_cached CF attribute. On 06/18/2012 06:01 PM, Oleg Dulin wrote: Dear distinguished colleagues: I don't want all of my CFs cached, but one in particular I do. How can I configure that ? Thanks, Oleg
Re: 1.0.3 CLI oddities
Sounds like https://issues.apache.org/jira/browse/CASSANDRA-3558 and the other tickets reference there. On 11/28/2011 05:05 AM, Janne Jalkanen wrote: Hi! (Asked this on IRC too, but didn't get anyone to respond, so here goes...) Is it just me, or are these real bugs? On 1.0.3, from CLI: update column family XXX with gc_grace = 36000; just says null with nothing logged. Previous value is the default. Also, on 1.0.3, update column family XXX with compression_options={sstable_compression:SnappyCompressor,chunk_length_kb:64}; returns Internal error processing system_update_column_family and log says Invalid negative or null chunk_length_kb (stack trace below) Setting the compression options worked on 1.0.0 when I was testing (though my 64 kB became 64 MB, but I believe this was fixed in 1.0.3.) Did the syntax change between 1.0.0 and 1.0.3? Or am I doing something wrong? The database was upgraded from 0.6.13 to 1.0.0, then scrubbed, then compression options set to some CFs, then upgraded to 1.0.3 and trying to set compression on other CFs. Stack trace: ERROR [pool-2-thread-68] 2011-11-28 09:59:26,434 Cassandra.java (line 4038) Internal error processing system_update_column_family java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.io.IOException: org.apache.cassandra.config.ConfigurationException: Invalid negative or null chunk_length_kb at org.apache.cassandra.thrift.CassandraServer.applyMigrationOnStage(CassandraServer.java:898) at org.apache.cassandra.thrift.CassandraServer.system_update_column_family(CassandraServer.java:1089) at org.apache.cassandra.thrift.Cassandra$Processor$system_update_column_family.process(Cassandra.java:4032) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680) Caused by: java.util.concurrent.ExecutionException: java.io.IOException: org.apache.cassandra.config.ConfigurationException: Invalid negative or null chunk_length_kb at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.thrift.CassandraServer.applyMigrationOnStage(CassandraServer.java:890) ... 7 more Caused by: java.io.IOException: org.apache.cassandra.config.ConfigurationException: Invalid negative or null chunk_length_kb at org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:78) at org.apache.cassandra.db.migration.Migration.apply(Migration.java:156) at org.apache.cassandra.thrift.CassandraServer$2.call(CassandraServer.java:883) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) ... 3 more Caused by: org.apache.cassandra.config.ConfigurationException: Invalid negative or null chunk_length_kb at org.apache.cassandra.io.compress.CompressionParameters.validateChunkLength(CompressionParameters.java:167) at org.apache.cassandra.io.compress.CompressionParameters.create(CompressionParameters.java:52) at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:796) at org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:74) ... 7 more ERROR [MigrationStage:1] 2011-11-28 09:59:26,434 AbstractCassandraDaemon.java (line 133) Fatal exception in thread Thread[MigrationStage:1,5,main] java.io.IOException: org.apache.cassandra.config.ConfigurationException: Invalid negative or null chunk_length_kb at org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:78) at org.apache.cassandra.db.migration.Migration.apply(Migration.java:156) at org.apache.cassandra.thrift.CassandraServer$2.call(CassandraServer.java:883) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:680) Caused by: org.apache.cassandra.config.ConfigurationException: Invalid negative or null chunk_length_kb at org.apache.cassandra.io.compress.CompressionParameters.validateChunkLength(CompressionParameters.java:167) at org.apache.cassandra.io.compress.CompressionParameters.create(CompressionParameters.java:52) at
Re: Second Cassandra users survey
- It would be super cool if all of that counter work made it possible to support other atomic data types (sets? CAS? just pass a assoc/commun Function to apply). - Again with types, pluggable type specific compression. - Wishy washy wish: Simpler elasticity I would like to go from 6--8--7 nodes without each of those being an annoying fight with tokens. - Gossip as library. Gossip/failure detection is something C* seems to have gotten particularly right (or at least it's something that has not needed to change much). It would be cool to use Cassandra's gossip protocol as distributed systems building tool a la ZooKeeper. On 11/01/2011 06:59 PM, Jonathan Ellis wrote: Hi all, Two years ago I asked for Cassandra use cases and feature requests. [1] The results [2] have been extremely useful in setting and prioritizing goals for Cassandra development. But with the release of 1.0 we've accomplished basically everything from our original wish list. [3] I'd love to hear from modern Cassandra users again, especially if you're usually a quiet lurker. What does Cassandra do well? What are your pain points? What's your feature wish list? As before, if you're in stealth mode or don't want to say anything in public, feel free to reply to me privately and I will keep it off the record. [1] http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01148.html [2] http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg01446.html [3] http://www.mail-archive.com/dev@cassandra.apache.org/msg01524.html
Re: CMS GC initial-mark taking 6 seconds , bad?
On 10/20/2011 09:38 AM, Maxim Potekhin wrote: I happen to have 48GB on each machines I use in the cluster. Can I assume that I can't really use all of this memory productively? Do you have any suggestion related to that? Can I run more than one instance on Cassandra on the same box (using different ports) to take advantage of this memory, assuming the disk has enough bandwidth? You are likely to not have good luck with a JVM heap that large. But you can: - Leave all that memory to the OS page cache. - mmap index files - use an off heap cache All of those are productive uses.
Re: ApacheCon meetup?
On 10/11/2011 12:05 PM, Eric Evans wrote: Let's do it. We can organize an official one, and still grab food together if that's not enough. :) Great! Thanks for putting this together.
ApacheCon meetup?
ApacheCon NA is coming up next month. I suspect there will be at least a few Cassandra users there (yeah new release!). Would anyone be interested in getting together and sharing some stories? This could either be a official [1] meetup. Or grabbing food together sometime. [1] http://wiki.apache.org/apachecon/ApacheMeetupsNa11
Re: Surgecon Meetup?
So it sounds like there are about a half dozen of us, some coming Wednesday, others Thursday. I'll have some Cassandra eye logos out around lunch both of those days. If that herds us together then success! If not I'll try something more formal. Looking forward to meeting everyone. On 09/25/2011 07:27 PM, Chris Burroughs wrote: Surge [1] is scalability focused conference in late September hosted in Baltimore. It's a pretty cool conference with a good mix of operationally minded people interested in scalability, distributed systems, systems level performance and good stuff like that. You should go! [2] Anyway, I'll be there if there, and are if any other Cassandra users are coming I'm happy to help herd us towards meeting up, lunch, hacking, etc. I *think* there might be some time for structured BoF type sessions as well. [1] http://omniti.com/surge/2011 [2] Actually tickets recenlty sold out, you should go in 2012!
Surgecon Meetup?
Surge [1] is scalability focused conference in late September hosted in Baltimore. It's a pretty cool conference with a good mix of operationally minded people interested in scalability, distributed systems, systems level performance and good stuff like that. You should go! [2] Anyway, I'll be there if there, and are if any other Cassandra users are coming I'm happy to help herd us towards meeting up, lunch, hacking, etc. I *think* there might be some time for structured BoF type sessions as well. [1] http://omniti.com/surge/2011 [2] Actually tickets recenlty sold out, you should go in 2012!
Re: cassandra server disk full
On 07/25/2011 01:53 PM, Ryan King wrote: Actually I was wrong– our patch will disable gosisp and thrift but leave the process running: https://issues.apache.org/jira/browse/CASSANDRA-2118 If people are interested in that I can make sure its up to date with our latest version. Thanks Ryan. /me expresses interest. Zombie nodes when the file system does something interesting are not fun.
Re: Survey: Cassandra/JVM Resident Set Size increase
Thanks to everyone who responded (I think I learned a few new tricks from seeing what you tried and how your monitor). I didn't see any patterns in JVM, OS, cassandra versions etc. At this time I'm confident in saying CASSANDRA-2868 (and thus really http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7066129) is the culprit. On 07/12/2011 09:28 AM, Chris Burroughs wrote: ### Preamble There have been several reports on the mailing list of the JVM running Cassandra using too much memory. That is, the resident set size is (max java heap size + mmaped segments) and continues to grow until the process swaps, kernel oom killer comes along, or performance just degrades too far due to the lack of space for the page cache. It has been unclear from these reports if there is a pattern. My hope here is that by comparing JVM versions, OS versions, JVM configuration etc., we will find something. Thank you everyone for your time. Some example reports: - http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html - https://issues.apache.org/jira/browse/CASSANDRA-2868 - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-problem-td6545642.html For reference theories include (in no particular order): - memory fragmentation - JVM bug - OS/glibc bug - direct memory - swap induced fragmentation - some other bad interaction of cassandra/jdk/jvm/os/nio-insanity. ### Survey 1. Do you think you are experiencing this problem? 2. Why? (This is a good time to share a graph like http://www.twitpic.com/5fdabn or http://img24.imageshack.us/img24/1754/cassandrarss.png) 2. Are you using mmap? (If yes be sure to have read http://wiki.apache.org/cassandra/FAQ#mmap , and explain how you have used pmap [or another tool] to rule you mmap and top decieving you.) 3. Are you using JNA? Was mlockall succesful (it's in the logs on startup)? 4. Is swap enabled? Are you swapping? 5. What version of Apache Cassandra are you using? 6. What is the earliest version of Apache Cassandra you recall seeing this problem with? 7. Have you tried the patch from CASSANDRA-2654 ? 8. What jvm and version are you using? 9. What OS and version are you using? 10. What are your jvm flags? 11. Have you tried limiting direct memory (-XX:MaxDirectMemorySize) 12. Can you characterise how much GC your cluster is doing? 13. Approximately how many read/writes per unit time is your cluster doing (per node or the whole cluster)? 14. How are you column families configured (key cache size, row cache size, etc.)?
Re: JNA to avoid swap but physical memory increase
On 07/15/2011 07:24 AM, Daniel Doubleday wrote: Also our experience shows that the jna call does not prevent swapping so the general advice is disable swap. Can you confirm you don't get the (paraphrasing) whoops we tried mlockall but ulimits denied us message on startup?
Re: Storing counters in the standard column families along with non-counter columns ?
On 07/13/2011 03:57 PM, Aaron Morton wrote: You can always use a dedicated CF for the counters, and use the same row key. Of course one could do this. The problem is you are now spending ~2x disk space on row keys, and app specific client code just became more complicated.
Survey: Cassandra/JVM Resident Set Size increase
### Preamble There have been several reports on the mailing list of the JVM running Cassandra using too much memory. That is, the resident set size is (max java heap size + mmaped segments) and continues to grow until the process swaps, kernel oom killer comes along, or performance just degrades too far due to the lack of space for the page cache. It has been unclear from these reports if there is a pattern. My hope here is that by comparing JVM versions, OS versions, JVM configuration etc., we will find something. Thank you everyone for your time. Some example reports: - http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html - https://issues.apache.org/jira/browse/CASSANDRA-2868 - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-problem-td6545642.html For reference theories include (in no particular order): - memory fragmentation - JVM bug - OS/glibc bug - direct memory - swap induced fragmentation - some other bad interaction of cassandra/jdk/jvm/os/nio-insanity. ### Survey 1. Do you think you are experiencing this problem? 2. Why? (This is a good time to share a graph like http://www.twitpic.com/5fdabn or http://img24.imageshack.us/img24/1754/cassandrarss.png) 2. Are you using mmap? (If yes be sure to have read http://wiki.apache.org/cassandra/FAQ#mmap , and explain how you have used pmap [or another tool] to rule you mmap and top decieving you.) 3. Are you using JNA? Was mlockall succesful (it's in the logs on startup)? 4. Is swap enabled? Are you swapping? 5. What version of Apache Cassandra are you using? 6. What is the earliest version of Apache Cassandra you recall seeing this problem with? 7. Have you tried the patch from CASSANDRA-2654 ? 8. What jvm and version are you using? 9. What OS and version are you using? 10. What are your jvm flags? 11. Have you tried limiting direct memory (-XX:MaxDirectMemorySize) 12. Can you characterise how much GC your cluster is doing? 13. Approximately how many read/writes per unit time is your cluster doing (per node or the whole cluster)? 14. How are you column families configured (key cache size, row cache size, etc.)?
Re: Storing counters in the standard column families along with non-counter columns ?
On 07/10/2011 01:09 PM, Aditya Narayan wrote: Is there any target version in near future for which this has been promised ? The ticket is problematic in that it would -- unless someone has a clever new idea -- require breaking thrift compatibility to add it to the api. Since is unfortunate since it would be so useful. If it's in the 0.8.x series it will only be through CQL.
Re: Cassandra DC Upcoming Meetup
On 06/15/2011 08:57 AM, Chris Burroughs wrote: Cassandra DC's first meetup of the pizza and talks variety will be on July 6th. There will be an introductory sort of presentation and a totally cool one on Pig integration. If you are in the DC area it would be great to see you there. http://www.meetup.com/Cassandra-DC-Meetup/events/22145481/ My totally anecdotal impression from going to several Big Data/Hadoop/JUG meetups in the DC area is that there is a reasonable amount of interest, but not a large amount of production use. In other words, this is a great time to bring along your Cassandra Curious friends and co-workers! Hope to see some of you tomorrow. Chris Burroughs
Re: 99.999% uptime - Operations Best Practices?
On 06/22/2011 10:03 PM, Edward Capriolo wrote: I have not read the original thread concerning the problem you mentioned. One way to avoid OOM is large amounts of RAM :) On a more serious note most OOM's are caused by setting caches or memtables too large. If the OOM was caused by a software bug, the cassandra devs are on the ball and move fast. I still suggest not jumping into a release right away. For what it's worth that particular thread was about the kernel oom killer, which is a good example of a the kind of gotcha that has caused several people to chime in with the importance of monitoring both Cassandra and the OS.
Re: 99.999% uptime - Operations Best Practices?
On 06/22/2011 07:12 PM, Les Hazlewood wrote: Telling me to read the mailing lists and follow the issue tracker and use monitoring software is all great and fine - and I do all of these things today already - but this is a philosophical recommendation that does not actually address my question. So I chalk this up as an error on my side in not being clear in my question - my apologies. Let me reformulate it :) For what it's worth that was intended as a concrete suggestion. We adopted Cassandra a year ago when (IMHO) it was a mistake to do so it without the willingness to develop sufficient in house expertise to internally patch/fork/debug if needed. Things are more mature now, best practices more widespread etc., but you should judge that yourself. In the spirit of your re-formulated questions: - Read-before-write is a Cassandra anti-pattern, avoid it if at all possible. - Those optional lines in the env script about GC logging? Uncomment them on at least some of your boxes. - use MLOCKALL+mmap, or standard io, but not mmap without MLOCKALL.
Re: 99.999% uptime - Operations Best Practices?
On 06/23/2011 01:56 PM, Les Hazlewood wrote: Is there a roadmap or time to 1.0? Even a ballpark time (e.g next year 3rd quarter, end of year, etc) would be great as it would help me understand where it may lie in relation to my production rollout. The C* devs are rather strongly inclined against putting too much meaning in version numbers. The next major release might be called 1.0. Or maybe it won't. Either way it won't be different code or support from something called 0.9 or 10.0. September 8th is the feature freeze for the next major release.
Re: BloomFilterFalsePositives equals 1.0
To be precise, you made n requests for non-existent keys, got n negative responses, and BloomFilterFalsePositives also went up by n? On 06/21/2011 11:06 PM, Preston Chang wrote: Hi,all: I have a problem with bloom filter. When made a test which tried to get some nonexistent keys, it seemed that the bloom filter does not work. The 'BloomFilterFalseRatio' was 1.0 and the 'BloomFilterFalsePositives' was rising and the disk I/O utils reached 100% according to 'iostat'. I found the patch in https://issues.apache.org/jira/browse/CASSANDRA-2637 , but in my cluster key cache had been enabled already. My Cassandra version is 0.7.3. There are 3 nodes and RF is 3. Thanks for your help.
Re: OOM (or, what settings to use on AWS large?)
On 06/22/2011 08:53 AM, Sasha Dolgy wrote: Yes ... this is because it was the OS that killed the process, and wasn't related to Cassandra crashing. Reviewing our monitoring, we saw that memory utilization was pegged at 100% for days and days before it was finally killed because 'apt' was fighting for resource. At least, that's as far as I got in my investigation before giving up, moving to 0.8.0 and implementing 24hr nodetool repair on each node via cronjobso far ... no problems. In `free` terms, by pegged do you mean that free Mem was 0, or -/+ buffers/cache as 0?
Re: 99.999% uptime - Operations Best Practices?
On 06/22/2011 05:33 PM, Les Hazlewood wrote: Just to be clear: I understand that resources like [1] and [2] exist, and I've read them. I'm just wondering if there are any 'gotchas' that might be missing from that documentation that should be considered and if there are any recommendations in addition to these documents. Thanks, Les [1] http://www.datastax.com/docs/0.8/operations/index [2] http://wiki.apache.org/cassandra/Operations Well if they new some secret gotcha the dutiful cassandra operators of the world would update the wiki. The closest thing to a 'gotcha' is that neither Cassandra nor any other technology is going to get you those nines. Humans will need to commit to reading the mailing lists, following JIRA, and understanding what the code is doing. And humans will need to commit to combine that understanding with monitoring and alerting to figure out all of the it depends for your particular case.
Re: OOM (or, what settings to use on AWS large?)
Do all of the reductions in Used on that graph correspond to node restarts? My Zabbix for reference: http://img194.imageshack.us/img194/383/2weekmem.png On 06/22/2011 06:35 PM, Sasha Dolgy wrote: http://www.twitpic.com/5fdabn http://www.twitpic.com/5fdbdg i do love a good graph. two of the weekly memory utilization graphs for 2 of the 4 servers from this ring... week 21 was a nice week ... the week before 0.8.0 went out proper. since then, bumped up to 0.8 and have seen a steady increase in the memory consumption (used) but have not seen the swap do what it did ...and the buffered/cached seems much better -sd On Thu, Jun 23, 2011 at 12:09 AM, Chris Burroughs chris.burrou...@gmail.com wrote: In `free` terms, by pegged do you mean that free Mem was 0, or -/+ buffers/cache as 0?
Cassandra DC Upcoming Meetup
Cassandra DC's first meetup of the pizza and talks variety will be on July 6th. There will be an introductory sort of presentation and a totally cool one on Pig integration. If you are in the DC area it would be great to see you there. http://www.meetup.com/Cassandra-DC-Meetup/events/22145481/
Re: Data directories
On 06/08/2011 05:54 AM, Héctor Izquierdo Seliva wrote: Is there a way to control what sstables go to what data directory? I have a fast but space limited ssd, and a way slower raid, and i'd like to put latency sensitive data into the ssd and leave the other data in the raid. Is this possible? If not, how well does cassandra play with symlinks? Another option would be to use the ssd as a block level cache with something like flashcache https://github.com/facebook/flashcache/.
Re: Index interval tuning
On 05/10/2011 10:24 PM, aaron morton wrote: What version and what were the values for RecentBloomFilterFalsePositives and BloomFilterFalsePositives ? The bloom filter metrics are updated in SSTableReader.getPosition() the only slightly odd thing I can see is that we do not count a key cache hit a a true positive for the bloom filter. If there were a lot of key cache hits and a few false positives the ratio would be wrong. I'll ask around, does not seem to apply to Hectors case though. 0.7.1 No key cache. BloomFilterFalsePositives: 48130 Read Count: 153973494 RecentBloomFilterFalsePositives: 4, 1, 2, 0, 0, 1
Re: Index interval tuning
On 05/10/2011 02:12 PM, Peter Schuller wrote: That reminds me, my false positive ration is stuck at 1.0, so I guess bloom filters aren't doing a lot for me. That sounds unlikely unless you're hitting some edge case like reading a particular row that happened to be a collision, and only that row. This is from JMX stats on the column family store? (From jmx) I also see BloomFilterFalseRatio stuck at 1.0 on my production nodes. The only values that RecentBloomFilterFalseRatio had over the past several minutes were 0.0 and 1.0. While I can't prove that isn't accurate, it is very suspicions. The code looked reasonable until I got to SSTableReader, which was too complicated to just glance through.
Re: Native heap leaks?
On 2011-05-05 06:30, Hannes Schmidt wrote: This was my first thought, too. We switched to mmap_index_only and didn't see any change in behavior. Looking at the smaps file attached to my original post, one can see that the mmapped index files take up only a minuscule part of RSS. I have not looked into smaps before. But it actually seems odd that that mmaped Index files are taking up so *little memory*. Are they only a few kb on disk? Is this a snapshot taken shortly after the process started or before the OOM killer is presumably about to come along. How long does it take to go from 1.1 G to 2.1 G resident? Either way, it would be worthwhile to set one node to standard io to make sure it's really not mmap causing the problem. Anyway, assuming it's not mmap, here are the other similar threads on the topic. Unfortunately none of them claim an obvious solution: http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html http://www.mail-archive.com/user@cassandra.apache.org/msg08063.html http://www.mail-archive.com/user@cassandra.apache.org/msg12036.html http://mail.openjdk.java.net/pipermail/hotspot-dev/2011-April/004091.html
Cassandra Meetup in DC
http://www.meetup.com/Cassandra-DC-Meetup/ *What*: First Cassandra DC Meetup *When*: Thursday, May 12, 2011 at 6:30 PM *Where*: Northside Social Coffee Wine - 3211 Wilson Blvd Arlington, VA I'm pleased to announce the the first Cassandra DC Meetup http://www.meetup.com/Cassandra-DC-Meetup/events/17207138/. Come have a drink, meet your fellow members, talk about Apache Cassandra, discuss Greek mythological prophets, and what you want out of the group.
flashcache experimentation
https://github.com/facebook/flashcache/ FlashCache is a general purpose writeback block cache for Linux. We have a case where: - Access to data is not uniformly random (let's say Zipfian). - The hot set RAM. - Size of disk is such that buying enough SSDs, fast drives, multiple drives, etc would be undesirable. This seems like a good case for flashcache. However, as far as I can tell from searching no one has tried this and posted any results. I was wondering if anyone has tried flashcache in a similar situation with Cassandra and if so how the experience went.
Re: CL.ONE reads / RR / badness_threshold interaction
On 04/12/2011 06:27 PM, Peter Schuller wrote: So to increase pinny-ness I'll further reduce RR chance and set a badness threshold. Thanks all. Just be aware that, assuming I am not missing something, while this will indeed give you better cache locality under normal circumstances - once that closest node does go down, traffic will then go to a node which will have potentially zero cache hit rate on that data since all reads up to that point were taken by the node that just went down. So it's not an obvious win depending. Yeah there less than great behaviour when nodes are restarted or otherwise go down with this configuration. Probably still preferable for my current situation. Other's mileage may vary. http://img27.imageshack.us/img27/85/cacherestart.png
Re: quick repair tool question
On 04/12/2011 11:11 AM, Jonathan Colby wrote: I'm not sure if this is the kosher way to rebuild the sstable data, but it seemed to work. http://wiki.apache.org/cassandra/Operations#Handling_failure Option #3.
Analysing hotspot gc logs
To avoid taking my own thread [1] off on a tangent. Does anyone have a reccomendation for a tool to graphical analysis (ie make useful graphs) out of hoptspot gc logs? Google searches have turned up several results along the lines of go try this zip file [2]. [1] http://www.mail-archive.com/user@cassandra.apache.org/msg12134.html [2] http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2009-August/000420.html
Re: Minor Follow-up: reduced cached mem; resident set size growth
On 04/05/2011 03:04 PM, Chris Burroughs wrote: I have gc logs if anyone is interested. This is from a node with standard io, jna enabled, but limits were not set for mlockall to succeed. One can see -/+ buffers/cache free shrinking and the C* pid's RSS growing. Includes several days of: gc log free -s /proc/$PID/status http://www.filefactory.com/file/ca94892/n/04-08.tar.gz Please enjoy! (If there is a preferred way to share the tarball let me know.)
Re: CL.ONE reads / RR / badness_threshold interaction
Peter, thank you for the extremely detailed reply. To now answer my own question, the critical points that are different from what I said earlier are: that CL.ONE does prefer *one* node (which one depending on snitch) and that RR uses digests (which are not mentioned on the wiki page [1]) instead of comparing raw requests. Totally tangential, but in the case of CL.ONE with narrow rows making the request and taking the fastest would probably be better, but having things work both ways depending on row size sounds painfully complicated. (As Aaron points out this is not how things work now.) I am assuming that RR digests save on bandwidth, but to generate the digest with a row cache miss the same number of disk seeks are required (my nemesis is disk io). So to increase pinny-ness I'll further reduce RR chance and set a badness threshold. Thanks all. [1] http://wiki.apache.org/cassandra/ReadRepair
Re: Minor Follow-up: reduced cached mem; resident set size growth
On 04/05/2011 04:38 PM, Peter Schuller wrote: - Different collectors: -XX:+UseParallelGC -XX:+UseParallelOldGC Unless you also removed the -XX:+UseConcMarkSweepGC I *think* it takes precedence, so that the above options would have no effect. I didn't test. In either case, did you definitely confirm CMS was no longer being used? (Should be pretty obvious if you ran with -XX:+PrintGCDetails which looks plenty different w/o CMS) More precisely, I did this: # GC tuning options #JVM_OPTS=$JVM_OPTS -XX:+UseParNewGC #JVM_OPTS=$JVM_OPTS -XX:+UseConcMarkSweepGC #JVM_OPTS=$JVM_OPTS -XX:+CMSParallelRemarkEnabled #JVM_OPTS=$JVM_OPTS -XX:SurvivorRatio=8 #JVM_OPTS=$JVM_OPTS -XX:MaxTenuringThreshold=1 #JVM_OPTS=$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75 #JVM_OPTS=$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly JVM_OPTS=$JVM_OPTS -XX:+UseParallelGC JVM_OPTS=$JVM_OPTS -XX:+UseParallelOldGC I have gc logs if anyone is interested. Yes :) By have gc logs I meant had them until I accidental blew them away while restarting a server. Will post them in a day or two when there is a reasonable amount of data or the quantum state collapses and the problem vanishes when it is observed. [1] http://img194.imageshack.us/img194/383/2weekmem.png I did go back and revisit the old thread... maybe I'm missing something, but just to be real sure: What does the no color/white mean on this graph? Is that application memory (resident set)? I'm not really sure what I'm looking for since you already said you tested with 'standard' which rules out the resident-set-memory-as-a-result-of-mmap being counted towards the leak. But still. I will be the first to admit that Zabbix's graphs are not the... easiest to read. My interpretation is that no color is none of the above and by being unavailable is thus in use by applications. This fits with what I see will free and measurements of the RSS of the jvm from /proc/. I'll leave free -s going for a few days while waiting on the gc logs as an extra sanity test. That's probably easier to reason about anyway.
CL.ONE reads / RR / badness_threshold interaction
My understanding for For CL.ONE. For the node that receives the request: (A) If RR is enabled and this node contains the needed row -- return immediately and do RR to remaining replicas in background. (B) If RR is off and this node contains the needed row -- return the needed data immediately. (C) If this node does not have the needed row -- regardless of RR ask all replicas and return the first result. However case (C) as I have described it does not allow for any notion of 'pinning' as mentioned for dynamic_snitch_badness_threshold: # if set greater than zero and read_repair_chance is 1.0, this will allow # 'pinning' of replicas to hosts in order to increase cache capacity. # The badness threshold will control how much worse the pinned host has to be # before the dynamic snitch will prefer other replicas over it. This is # expressed as a double which represents a percentage. Thus, a value of # 0.2 means Cassandra would continue to prefer the static snitch values # until the pinned host was 20% worse than the fastest. The wiki states CL.ONE Will return the record returned by the first replica to respond [1] implying that the request goes to multiple replicas, but datastax's docs state that only one node will receive the request (Returns the response from *the* closest replica, as determined by the snitch configured for the cluster [2]). Could someone clarify how CL.ONE reads with RR off work? [1] http://wiki.apache.org/cassandra/API [2] http://www.datastax.com/docs/0.7/consistency/index#choosing-consistency-levels emphasis added
Re: IndexInterval Tuning
On 04/05/2011 09:57 AM, Jonathan Ellis wrote: On Tue, Apr 5, 2011 at 8:54 AM, Jonathan Ellis jbel...@gmail.com wrote: Adjusting indexinterval is unlikely to be useful on very narrow rows. (Its purpose is to make random access to _large_ rows doable.) Whoops, that's column_index_size_in_kb. I'd play w/ keycache before index_interval personally. (If you can get 100% key cache hit rate it doesn't really matter what index interval is, as long as you can still build the cache effectively.) I've already tried a key cache equal to and larger (up to what I have heap space for) than my current row cache. But for very narrow rows the row cache is empirically and theoretically better. I realise changing IndexInterval is an unusual proposed configuration, but such is the burden of high cardinality narrow rows.
Minor Follow-up: reduced cached mem; resident set size growth
This is a minor followup to this thread which includes required context: http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html I haven't solved the problem, but since negative results can also be useful I thought I would share them. Things I tried unsuccessfully (on individual nodes except for the upgrade): - Upgrade from Cassandra 0.6 to 0.7 - Different collectors: -XX:+UseParallelGC -XX:+UseParallelOldGC - JNA (but not mlockall) - Switch disk_access_mode from standard to mmap_index_only (obviously in this case RSS is less than useful, but overall memory graph still was bad looking like this [1]). On #cassandra there was speculation that a large (200k) row cache may be inducing heap fragmentation. I have not ruled this out but have been unable to do that in stand alone ConcurrentLinkedHashMap stress testing. Since turning off the row cache would be a cure worse than the disease I have not tried that yet with a real cluster. Future possibilities would be to get the limits set right for mlockall, trying combinations of the above, and running without caches. I have gc logs if anyone is interested. [1] http://img194.imageshack.us/img194/383/2weekmem.png
Re: How to determine if repair need to be run
On 03/29/2011 01:18 PM, Peter Schuller wrote: (What *would* be useful perhaps is to be able to ask a node for the time of its most recently started repair, to facilitate easier comparison with GCGraceSeconds for monitoring purposes.) I concur. JIRA time? (Perhaps keeping track of the same thing for major compactions would also be useful?)
Re: On 0.6.6 to 0.7.3 migration, DC-aware traffic and minimising data transfer
On 03/11/2011 03:46 PM, Jonathan Ellis wrote: Repairs is not yet WAN-optimized but is still cheap if your replicas are close to consistent since only merkle trees + inconsistent ranges are sent over the network. What is the ticket number for WAN optimized repair?
Re: cassandra in-production experiences with .7 series
On 03/05/2011 05:27 PM, Paul Pak wrote: Hello all, I was wondering if people could share their overall experiences with using .7 series of Cassandra in production? Is anyone using it? For what it's worth we are using a dozen node 0.7.x cluster have not had any major problems (our uses cases dodged most of the less pleasant bugs). This replaced a smaller 0.6.x cluster that we were not happy with. Weather the new code really helped (the main feature we wanted was mx4j do to idiosyncratic features of our monitoring system) or not we didn't have time to experimentally determine.
Re: Reducing memory footprint
On 03/04/2011 03:51 PM, Casey Deccio wrote: Are you saying: that you want a smaller heap and what settings to change to accommodate that, or that you have already set a small heap of x and Cassandra is using significantly more than that? Based on my observation above, the latter. Casey As Aaron said then the first things to look at are your jvm settings, jvm version, and io configuration (standard v mmap). You may also wish to read this thread: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/reduced-cached-mem-resident-set-size-growth-td5967110.html
Re: Reducing memory footprint
On 03/04/2011 01:53 PM, Casey Deccio wrote: I have a small ring of cassandra nodes that have somewhat limited memory capacity for the moment. Cassandra is eating up all the memory on these nodes. I'm not sure where to look first in terms of reducing the foot print. Keys cached? Compaction? Any hints would be greatly appreciated. Regards, Casey What do you mean by eating up the memory? Resident set size, low memory available to page cache, excessive gc of the jvm's heap? Are you saying: that you want a smaller heap and what settings to change to accommodate that, or that you have already set a small heap of x and Cassandra is using significantly more than that?
Re: OOM exceptions
- Does this occur only during compaction or at seemingly random times? - How large is your heap? What jvm settings are you using? How much physical RAM do you have? - Do you have the row and/or key cache enabled? How are they configured? How large are they when the OOM is thrown? On 03/04/2011 02:38 PM, Mark Miller wrote: Other than adding more memory to the machine is there a way to solve this? Please help. Thanks ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception in thread Thread[COMPACTION-POOL:1,5,main] java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2798) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298) at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66) at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311) at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284) at org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87) at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)
Re: OOM exceptions
See also: http://www.datastax.com/docs/0.7/troubleshooting/index#nodes-are-dying-with-oom-errors On 03/04/2011 03:05 PM, Chris Burroughs wrote: - Does this occur only during compaction or at seemingly random times? - How large is your heap? What jvm settings are you using? How much physical RAM do you have? - Do you have the row and/or key cache enabled? How are they configured? How large are they when the OOM is thrown? On 03/04/2011 02:38 PM, Mark Miller wrote: Other than adding more memory to the machine is there a way to solve this? Please help. Thanks ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception in thread Thread[COMPACTION-POOL:1,5,main] java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2798) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298) at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66) at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311) at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284) at org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87) at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)
Re: OOM exceptions
- Are you using a key cache? How many keys do you have? Across how many column families You configuration is unusual both in terms of not setting min heap == max heap and the percentage of available RAM used for the heap. Did you change the heap size in response to errors or for another reason? On 03/04/2011 03:25 PM, Mark wrote: This happens during compaction and we are not using the RowsCached attribute. Our initial/max heap are 2 and 6 respectively and we have 8 gigs in these machines. Thanks On 3/4/11 12:05 PM, Chris Burroughs wrote: - Does this occur only during compaction or at seemingly random times? - How large is your heap? What jvm settings are you using? How much physical RAM do you have? - Do you have the row and/or key cache enabled? How are they configured? How large are they when the OOM is thrown? On 03/04/2011 02:38 PM, Mark Miller wrote: Other than adding more memory to the machine is there a way to solve this? Please help. Thanks ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception in thread Thread[COMPACTION-POOL:1,5,main] java.lang.OutOfMemoryError: Java heap space at java.util.Arrays.copyOf(Arrays.java:2798) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) at org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298) at org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66) at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311) at org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284) at org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87) at org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140) at org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43) at org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130) at org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183) at org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94) at org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101) at org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636)
Re: Column name size
On 02/11/2011 05:06 AM, Patrik Modesto wrote: Hi all! I'm thinking if size of a column name could matter for a large dataset in Cassandra (I mean lots of rows). For example what if I have a row with 10 columns each has 10 bytes value and 10 bytes name. Do I have half the row size just of the column names and the other half of the data (not counting storage overhead)? What if I have 10M of these rows? Is there a difference? Should I use some 3bytes codes for a column name to save memory/bandwidth? Thanks, Patrik You are correct that you can for small row/column key values they key itself can represent a large proportion of the total size. I think you will find the consensus on this list is that trying to be clever with names is usually not worth the additional complexity. The right solution to this is https://issues.apache.org/jira/browse/CASSANDRA-47.
Re: Out of control memory consumption
On 02/09/2011 11:15 AM, Huy Le wrote: There is already an email thread on memory issue on this email list, but I creating a new thread as we are experiencing a different memory consumption issue. We are 12-server cluster. We use random partitioner with manually generated server tokens. Memory usage on one server keeps growing out of control. We ran flush and cleared key and row caches but and ran GC but heap memory usage won't go down. The only way to heap memory usage to go down is the restart cassandra. We have to do this one a day. All other servers have heap memory usage less than 500MB. This issue happened on both Cassandra 0.6.6 and 0.6.11. If the heap usages continues to grow an OOM will eventually be thrown. Are you experiencing OOMs on these boxes? If you are not OOMing, then what problem are you experiencing (excessive CPU use garbage collection for one example)? Our JVM info: java version 1.6.0_21 Java(TM) SE Runtime Environment (build 1.6.0_21-b06) Java HotSpot(TM) 64-Bit Server VM (build 17.0-b16, mixed mode) And JVM memory allocation: -Xms3G -Xmx3G Non-heap memory usage is 138MB. Any recommendation where should look to see why memory usage keep growing? Thanks! Huy Are you using standard, mmap_index_only, or mmap io? Are you using JNA?
Re: Default Listen Port
On 02/09/2011 04:00 PM, jeremy.truel...@barclayscapital.com wrote: What's the easiest way to change the port nodes listen for comm on from other nodes? It appears that the default is 8080 which collides with my tomcat server on one of our dev boxes. I tried doing something in cassandra.yaml like listen_address: 192.1.fake.2: but that doesn't work it throws an exception. Also can you not put the actual name of servers in the config or does it always have to be the actual ip address currently? Thanks. 8080 is used by jmx [1]. You can change that in cassandra-env.sh. hostnames are allowed. [1] http://wiki.apache.org/cassandra/FAQ#ports
Re: OOM during batch_mutate
On 02/07/2011 06:05 PM, Jonathan Ellis wrote: Sounds like the keyspace was created on the 32GB machine, so it guessed memtable sizes that are too large when run on the 16GB one. Use update column family from the cli to cut the throughput and operations thresholds in half, or to 1/4 to be cautious. This guessing is new in 0.7.x right? On a 0.6.x storage-conf.xml + sstables can be moved among machines with different amounts of RAM without needing to change anything through the cli?
Re: CF Read and Write Latency Histograms
On 02/04/2011 12:43 PM, Jonathan Ellis wrote: Can you create a ticket? I noticed the same thing. CASSANDRA-2123 created.
Re: 0.7.0 mx4j, get attribute
On 02/02/2011 01:41 PM, Ryan King wrote: On Wed, Feb 2, 2011 at 10:40 AM, Chris Burroughs chris.burrou...@gmail.com wrote: I'm using 0.7.0 and experimenting with the new mx4j support. http://host:port/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage Returns a nice pretty html page. For purposes of monitoring I would like to get a single attribute as xml. The docs [1] decribe a getattribute endpoint. But I have been unable to get anything other than a blank response from that. mx4j does not seem to include any logging for troubleshooting. Example: http://host:port/getattribute?objectname=org.apache.cassandra.request%3atype%3dReadStageattribute=PendingTasks returns 200 OK with no data. If anyone could point out what embarrassingly simple mistake I am making I would be much obliged. [1] http://mx4j.sourceforge.net/docs/ch05.html Note that many objects in cassandra aren't initialized until they're used for the first time. -ryan But if I can access them through jconsole just fine I don't see what would be stopping mx4j.
Re: 0.7.0 mx4j, get attribute
On 02/03/2011 11:29 AM, Ran Tavory wrote: Try adding this to the end of the URL: ?template=identity That works, thanks!
Re: reduced cached mem; resident set size growth
On 01/28/2011 09:19 PM, Chris Burroughs wrote: Thanks Oleg and Zhu. I swear that wasn't a new hotspot version when I checked, but that's obviously not the case. I'll update one node to the latest as soon as I can and report back. RSS over 48 hours with java 6 update 23: http://img716.imageshack.us/img716/5202/u2348hours.png I'll continue monitoring but RSS still appears to grow without bounds. Zhu reported a similar problem with Ubuntu 10.04. While possible, it would seem seam extraordinary unlikely that there is a glibc or kernel bug affecting us both.