RE: Data Modelling Help
Secondary indicies are inefficient and are deprecated, as far as I know. Unless you store many thousands of emails for a long time (which I recommend against), just use a single table with the partition key being the userid and the timestamp being the clustering (column) key, as in your schema. You might want to use a TTL to expire old emails. If you need to store a huge number of emails, consider splitting into tables by year, for example. If you had two tables (one for read emails and one for unread emails) you’d have to move rows between them when an email got marked (un)read. But it would support efficiently finding (un)read emails. Don From: Sandeep Gupta [mailto:sandy@gmail.com] Sent: Monday, April 27, 2015 11:46 AM To: user@cassandra.apache.org Subject: Fwd: Data Modelling Help Hi, I am a newbie with Cassandra and thus need data modelling help as I haven't found a resource that tackles the same problem. The user case is similar to an email-system. I want to store a timeline of all emails a user has received and then fetch them back with three different ways: 1. All emails ever received 2. Mails that have been read by a user 3. Mails that are still unread by a user My current model is as under: CREATE TABLE TIMELINE ( userID varchar, emailID varchar, timestamp bigint, read boolean, PRIMARY KEY (userID, timestamp) ) WITH CLUSTERING ORDER BY (timestamp desc); CREATE INDEX ON TIMELINE (userID, read); The queries I need to support are: SELECT * FROM TIMELINE where userID = 12; SELECT * FROM TIMELINE where userID = 12 order by timestamp asc; SELECT * FROM TIMELINE where userID = 12 and read = true; SELECT * FROM TIMELINE where userID = 12 and read = false; SELECT * FROM TIMELINE where userID = 12 and read = true order by timestamp asc; SELECT * FROM TIMELINE where userID = 12 and read = false order by timestamp asc; Queries are: 1. Should I keep read as my secondary index as It will be frequently updated and can create tombstones - per http://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_when_use_index_c.html its a problem. 2. Can we do inequality check on secondary index because i found out that atleast one equality condition should be present on secondary index 3. If this is not the right way to model, please suggest on how to support the above queries. Maintaining three different tables worries me about the number of insertions (for read/unread) as number of users * emails viewed per day will be huge. Thanks in advance. Best Regards! Keep Walking, ~ Sandeep
Cassandra hanging in IntervalTree.comparePoints() and in CompactionController.maxPurgeableTimestamp()
We deployed a brand new 13 node 2.1.4 C* cluster and used sstabloader to stream about 500GB into cassandra. The streaming took less than a day but afterwards pending compactions do not decrease. The Cassandra nodes (which have about 500 pending compactions each) seem to spend most of their time in IntervalTree.comparePoints() and in CompactionController.maxPurgegableTimestamp() (sometimes, too, in com.google.common.util.concurrent.Uninterruptibles.sleepUninteruptedly()). That's what Java VisualVM shows that they're doing in the sampler. Many of the nodes show 100% cpu usage per core. Any idea what's causing it to hang? Might http://qnalist.com/questions/5818079/reasons-for-nodes-not-compacting or https://issues.apache.org/jira/browse/CASSANDRA-8914 explain it? Altering the table to use , 'cold_reads_to_omit': 0.0 didn't help. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
Tables showing up as our_table-147a2090ed4211e480153bc81e542ebd/ in data dir
Using 2.1.4, tables in our data/ directory are showing up as our_table-147a2090ed4211e480153bc81e542ebd/ instead of as our_table/ Why would that happen? We're also seeing lagging compactions and high cpu usage. Thanks, Don
Questions about bootrapping and compactions during bootstrapping
Looking at the output of nodetool netstats I see that the bootstrapping nodes pulling from only two of the nine nodes currently in the datacenter. That surprises me: I'd think the vnodes it pulls from would be randomly spread across the existing nodes. We're using Cassandra 2.0.11 with 256 vnodes each. I also notice that while bootstrapping, the node is quite busy doing compactions. There are over 1000 pending compactions on the new node and it's not finished bootstrapping. I'd think those would be unnecessary, since the other nodes in the data center have zero pending compactions. Perhaps the compactions explains why running du -hs /var/lib/cassandra/data on the new node shows more disk space usage than on the old nodes. Is it reasonable to do nodetool disableautocompaction on the bootstrapping node? Should that be the default??? If I start bootstrapping one node, it's not yet in the cluster but it decides which token ranges it owns and requests streams for that data. If I then try to bootstrap a SECOND node concurrently, it will take over ownership of some token ranges from the first node. Will the first node then adjust what data it streams? It seems to me the cassandra server needs to keep track of both the OLD token ranges and vnodes and the NEW ones. I'm not convinced that running two bootstraps concurrently (starting the second one after several minutes of delay) is safe. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
RE: stream_throughput_outbound_megabits_per_sec
Sorry, I copy-and-pasted the wrong variable name. I meant to copy and paste streaming_socket_timeout_in_ms. So my question should be: streaming_socket_timeout_in_ms is the timeout per operation on the streaming socket. The docs recommend not to set it too low (because a timeout causes streaming to restart from the beginning). But the default 0 never times out. What's a reasonable value? # Enable socket timeout for streaming operation. # When a timeout occurs during streaming, streaming is retried from the start # of the current file. This _can_ involve re-streaming an important amount of # data, so you should avoid setting the value too low. # Default value is 0, which never timeout streams. # streaming_socket_timeout_in_ms: 0 My second question is: Does it stream an entire SSTable in one operation? I doubt it. How large is the object it streams in one operation? I'm tempted to put the timeout at 30 seconds or 1 minute. Is that too low?. The entire file (SSTable) is large – several hundred megabytes. Is the timeout for streaming the entire file? Or only a block of it? Don From: Marcus Eriksson [mailto:krum...@gmail.com] Sent: Friday, October 17, 2014 4:05 AM To: user@cassandra.apache.org Subject: Re: stream_throughput_outbound_megabits_per_sec On Thu, Oct 16, 2014 at 1:54 AM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: stream_throughput_outbound_megabits_per_sec is the timeout per operation on the streaming socket. The docs recommend not to have it too low (because a timeout causes streaming to restart from the beginning). But the default 0 never times out. What's a reasonable value? no, it is not a timeout, it states how fast sstables are streamed Does it stream an entire SSTable in one operation? I doubt it. How large is the object it streams in one operation? I'm tempted to put the timeout at 30 seconds or 1 minute. Is that too low? unsure what you meat by 'operation' here, but it is one tcp connection, streaming the whole file (if thats what we want) /Marcus
Is cassandra smart enough to serve Read requests entirely from Memtables in some cases?
Question about the read path in cassandra. If a partition/row is in the Memtable and is being actively written to by other clients, will a READ of that partition also have to hit SStables on disk (or in the page cache)? Or can it be serviced entirely from the Memtable? If you select all columns (e.g., select * from ) then I can imagine that cassandra would need to merge whatever columns are in the Memtable with what's in SStables on disk. But if you select a single column (e.g., select Name from where id= ) and if that column is in the Memtable, I'd hope cassandra could skip checking the disk. Can it do this optimization? Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
RE: Is cassandra smart enough to serve Read requests entirely from Memtables in some cases?
On the cassandra irc channel I discussed this question. I learned that the timestamp in the Memtable may be OLDER than the timestamp in some SSTable (e.g., due to hints or retries). So there’s no guarantee that the Memtable has the most recent version. But there may be cases, they say, in which the time stamp in the SSTable can be used to skip over SSTables that have older data (via metadata on SSTables, I presume). Memtable are like write-through caches and do NOT correspond to SSTables loaded from disk. From: jonathan.had...@gmail.com [mailto:jonathan.had...@gmail.com] On Behalf Of Jonathan Haddad Sent: Wednesday, October 22, 2014 9:24 AM To: user@cassandra.apache.org Subject: Re: Is cassandra smart enough to serve Read requests entirely from Memtables in some cases? No. Consider a scenario where you supply a timestamp a week in the future, flush it to sstable, and then do a write, with the current timestamp. The record in disk will have a timestamp greater than the one in the memtable. On Wed, Oct 22, 2014 at 9:18 AM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: Question about the read path in cassandra. If a partition/row is in the Memtable and is being actively written to by other clients, will a READ of that partition also have to hit SStables on disk (or in the page cache)? Or can it be serviced entirely from the Memtable? If you select all columns (e.g., “select * from ….”) then I can imagine that cassandra would need to merge whatever columns are in the Memtable with what’s in SStables on disk. But if you select a single column (e.g., “select Name from …. where id= ….”) and if that column is in the Memtable, I’d hope cassandra could skip checking the disk. Can it do this optimization? Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866tel:425.201.3900%20x%203866 C: (206) 819-5965tel:%28206%29%20819-5965 F: (646) 443-2333tel:%28646%29%20443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] -- Jon Haddad http://www.rustyrazorblade.com twitter: rustyrazorblade
Re: Question about adding nodes incrementally to a new datacenter: wait til all hosts come up so they can learn the token ranges?
Even with vnodes, when you add a node to a cluster, it takes over some portions of the token range. If the other nodes have been running for a long time you should bootstrap the new node, so it gets old data. Then you should run nodetool cleanup on the other nodes to eliminate no-longer-needed rows which now belong to the new node. So, my point is that to avoid the need to bootstrap and to cleanup, it's better to bring all nodes up at about the same time. If this is wrong, please explain why. Thanks, Don From: Robert Coli rc...@eventbrite.com Sent: Wednesday, October 15, 2014 1:54 PM To: user@cassandra.apache.org Subject: Re: Question about adding nodes incrementally to a new datacenter: wait til all hosts come up so they can learn the token ranges? On Tue, Oct 14, 2014 at 4:52 PM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: Suppose I create a new DC with 25 nodes. I have their IPs in cassandra-topology.properties. Twenty-three of the nodes start up, but two of the nodes fail to start. If I start replicating (via nodetool rebuild) without those two nodes, then when those 2 nodes enter the DC the distribution of tokens to vnodes will change and I'd need to rebuild or bootstrap, right? In other words, it's better to wait til all nodes come up before we start replicating. Does this sound right? I presume that all the nodes need to come up so it can learn the token ranges. I don't understand your question. Vnodes exist to randomly distribute data on each physical node into [n] virtual node chunks, 256 by default. They do this in order to allow you to add 2 nodes to your 25 node cluster without rebalancing the prior 23. The simplest way to illustrate this is to imagine a token range of 0-20 in a 4 node cluster with RF=1. A 0-5 B 5-10 C 10-15 D 15-20 (0) Each node has 25% of the data. If you add a new node E, and want it to join with 25% of the data, there is literally nowhere you can have it join to accomplish this goal. You have to join it in between one of the existing nodes, and then move each of those nodes so that the distribution is even again. This is why, prior to vnodes, the best practice was to double your cluster size. =Rob http://twitter.com/rcolidba
stream_throughput_outbound_megabits_per_sec
stream_throughput_outbound_megabits_per_sec is the timeout per operation on the streaming socket. The docs recommend not to have it too low (because a timeout causes streaming to restart from the beginning). But the default 0 never times out. What's a reasonable value? Does it stream an entire SSTable in one operation? I doubt it. How large is the object it streams in one operation? I'm tempted to put the timeout at 30 seconds or 1 minute. Is that too low? Some of our rebuilds hang for many hours and we figure we need a timeout. Thanks, Don
Question about adding nodes incrementally to a new datacenter: wait til all hosts come up so they can learn the token ranges?
Suppose I create a new DC with 25 nodes. I have their IPs in cassandra-topology.properties. Twenty-three of the nodes start up, but two of the nodes fail to start. If I start replicating (via nodetool rebuild) without those two nodes, then when those 2 nodes enter the DC the distribution of tokens to vnodes will change and I'd need to rebuild or bootstrap, right? In other words, it's better to wait til all nodes come up before we start replicating. Does this sound right? I presume that all the nodes need to come up so it can learn the token ranges. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
timeout for port 7000 on stateful firewall? streaming_socket_timeout_in_ms?
We have a stateful firewallhttp://en.wikipedia.org/wiki/Stateful_firewall between data centers for port 7000 (inter-cluster). How long should the idle timeout be for the connections on the firewall? Similarly what's appropriate for streaming_socket_timeout_in_ms in cassandra.yaml? The default is 0 (no timeout). I presume that streaming_socket_timeout_in_ms refers to streams such as for bootstrapping and rebuilding. Thanks Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
RE: Would warnings about overlapping SStables explain high pending compactions?
Version 2.0.9. We have 11 ongoing compactions on that node. From: Marcus Eriksson [mailto:krum...@gmail.com] Sent: Thursday, September 25, 2014 12:45 AM To: user@cassandra.apache.org Subject: Re: Would warnings about overlapping SStables explain high pending compactions? Not really What version are you on? Do you have pending compactions and no ongoing compactions? /Marcus On Wed, Sep 24, 2014 at 11:35 PM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: On one of our nodes we have lots of pending compactions (499).In the past we’ve seen pending compactions go up to 2400 and all the way back down again. Investigating, I saw warnings such as the following in the logs about overlapping SStables and about needing to run “nodetool scrub” on a table. Would the overlapping SStables explain the pending compactions? WARN [RMI TCP Connection(2)-10.5.50.30] 2014-09-24 09:14:11,207 LeveledManifest.java (line 154) At level 1, SSTableReader(path='/data/data/XYZ/ABC/XYZ-ABC-jb-388233-Data.db') [DecoratedKey(-6112875836465333229, 3366636664393031646263356234663832383264616561666430383739383738), DecoratedKey(-4509284829153070912, 3366336562386339376664376633353635333432636662373739626465393636)] overlaps SSTableReader(path='/data/data/XYZ/ABC/XYZ-ABC_blob-jb-388150-Data.db') [DecoratedKey(-4834684725563291584, 336633623334363664363632666365303664333936336337343566373838), DecoratedKey(-4136919579566299218, 3366613535646662343235336335633862666530316164323232643765323934)]. This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact that you have dropped sstables from another node into the data directory. Sending back to L0. If you didn't drop in sstables, and have not yet run scrub, you should do so since you may also have rows out-of-order within an sstable Thanks Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866tel:425.201.3900%20x%203866 C: (206) 819-5965tel:%28206%29%20819-5965 F: (646) 443-2333tel:%28646%29%20443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
Experience with multihoming cassandra?
We have large boxes with 256G of RAM and SSDs. From iostat, top, and sar we think the system has excess capacity. Anyone have recommendations about multihominghttp://en.wikipedia.org/wiki/Multihoming cassandra on such a node (connecting it to multiple IPs and running multiple cassandras simultaneously)? I'm skeptical, since Cassandra already has built-in multi-threading and since if the node went down multiple nodes would disappear. We're using C* version 2.0.9. A google/bing search for multihoming cassandra doesn't turn much up. Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
Adjusting readahead for SSD disk seeks
We're using cassandra as a key-value store; our values are small. So we're thinking we don't need much disk readahead (e.g., blockdev -getra /dev/sda). We're using SSDs. When cassandra does disk seeks to satisfy read requests does it typically have to read in the entire SStable into memory (assuming the bloom filter said yes)? If cassandra needs to read in lots of blocks anyway or if it needs to read the entire file during compaction then I'd expect we might as well have a big readahead. Perhaps there's a tradeoff between read latency and compaction time. Any feedback welcome. Thanks Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
Would warnings about overlapping SStables explain high pending compactions?
On one of our nodes we have lots of pending compactions (499).In the past we've seen pending compactions go up to 2400 and all the way back down again. Investigating, I saw warnings such as the following in the logs about overlapping SStables and about needing to run nodetool scrub on a table. Would the overlapping SStables explain the pending compactions? WARN [RMI TCP Connection(2)-10.5.50.30] 2014-09-24 09:14:11,207 LeveledManifest.java (line 154) At level 1, SSTableReader(path='/data/data/XYZ/ABC/XYZ-ABC-jb-388233-Data.db') [DecoratedKey(-6112875836465333229, 3366636664393031646263356234663832383264616561666430383739383738), DecoratedKey(-4509284829153070912, 3366336562386339376664376633353635333432636662373739626465393636)] overlaps SSTableReader(path='/data/data/XYZ/ABC/XYZ-ABC_blob-jb-388150-Data.db') [DecoratedKey(-4834684725563291584, 336633623334363664363632666365303664333936336337343566373838), DecoratedKey(-4136919579566299218, 3366613535646662343235336335633862666530316164323232643765323934)]. This could be caused by a bug in Cassandra 1.1.0 .. 1.1.3 or due to the fact that you have dropped sstables from another node into the data directory. Sending back to L0. If you didn't drop in sstables, and have not yet run scrub, you should do so since you may also have rows out-of-order within an sstable Thanks Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
Is there harm from having all the nodes in the seed list?
Is there any harm from having all the nodes listed in the seeds list in cassandra.yaml? Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
Is it wise to increase native_transport_max_threads if we have lots of CQL clients?
If we have hundreds of CQL clients (for C* 2.0.9), should we increase native_transport_max_threads in cassandra.yaml from the default (128) to the number of clients? If we don't do that, I presume requests will queue up, resulting in higher latency, What's a reasonable max value for increase native_transport_max_threads? Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
Trying to understand cassandra gc logs
I understand that cassandra uses ParNew GC for New Gen and CMS for Old Gen (tenured). I'm trying to interpret in the logs when a Full GC happens and what kind of Full GC is used. It never says Full GC or anything like that. But I see that whenever there's a line like 2014-09-15T18:04:17.197-0700: 117485.192: [CMS-concurrent-mark-start] the count of full GCs increases from {Heap after GC invocations=158459 (full 931): to a line like: {Heap before GC invocations=158459 (full 932): See the highlighted lines in the gclog output below. So, apparently there was a full GC between those two lines. Between those lines it also has two lines, such as: 2014-09-15T18:04:17.197-0700: 117485.192: Total time for which application threads were stopped: 0.0362080 seconds 2014-09-15T18:04:17.882-0700: 117485.877: Total time for which application threads were stopped: 0.0129660 seconds Also, the full count (932 above) is always exactly half the number (1864) FGC returned by jstat, as in dc1-cassandra01.dc01 /var/log/cassandra sudo jstat -gcutil 28511 S0 S1 E O P YGC YGCTFGCFGCT GCT 55.82 0.00 82.45 45.02 59.76 165772 5129.728 1864 320.247 5449.975 So, I am apparently correct that (full 932) is the count of Full GCs. I'm perplexed by the log output, though. I also see lines mentioning concurrent mark-sweep that do not appear to correspond to full GCs. So, my questions are: Is CMS used also for full GCs? If not, what kind of gc is done? The logs don't say.Lines saying Total time for which application threads were stopped appear twice per full gc; why? Apparently, even our Full GCs are fast. 99% of them finish within 0.18 seconds; 99.9% finish within 0.5 seconds (which may be too slow for some of our clients). Here below is some log output, with interesting parts highlighted in grey or yellow. Thanks, Don {Heap before GC invocations=158458 (full 931): par new generation total 1290240K, used 1213281K [0x0005bae0, 0x00061260, 0x00061260) eden space 1146880K, 100% used [0x0005bae0, 0x000600e0, 0x000600e0) from space 143360K, 46% used [0x000600e0, 0x000604ed87c0, 0x000609a0) to space 143360K, 0% used [0x000609a0, 0x000609a0, 0x00061260) concurrent mark-sweep generation total 8003584K, used 5983572K [0x00061260, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 44820K, used 26890K [0x0007fae0, 0x0007fd9c5000, 0x0008) 2014-09-15T18:04:17.131-0700: 117485.127: [GCBefore GC: Statistics for BinaryTreeDictionary: Total Free Space: 197474318 Max Chunk Size: 160662270 Number of Blocks: 3095 Av. Block Size: 63804 Tree Height: 32 Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 2285026 Max Chunk Size: 2279936 Number of Blocks: 8 Av. Block Size: 285628 Tree Height: 5 2014-09-15T18:04:17.133-0700: 117485.128: [ParNew Desired survivor size 73400320 bytes, new threshold 1 (max 1) - age 1: 44548776 bytes, 44548776 total : 1213281K-49867K(1290240K), 0.0264540 secs] 7196854K-6059170K(9293824K)After GC: Statistics for BinaryTreeDictionary: Total Free Space: 195160244 Max Chunk Size: 160662270 Number of Blocks: 3093 Av. Block Size: 63097 Tree Height: 32 After GC: Statistics for BinaryTreeDictionary: Total Free Space: 2285026 Max Chunk Size: 2279936 Number of Blocks: 8 Av. Block Size: 285628 Tree Height: 5 , 0.0286700 secs] [Times: user=0.37 sys=0.01, real=0.03 secs] Heap after GC invocations=158459 (full 931): par new generation total 1290240K, used 49867K [0x0005bae0, 0x00061260, 0x00061260) eden space 1146880K, 0% used [0x0005bae0, 0x0005bae0, 0x000600e0) from space 143360K, 34% used [0x000609a0, 0x00060cab2e18, 0x00061260) to space 143360K, 0% used [0x000600e0, 0x000600e0, 0x000609a0) concurrent mark-sweep generation total 8003584K, used 6009302K [0x00061260, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 44820K, used 26890K [0x0007fae0, 0x0007fd9c5000, 0x0008) } 2014-09-15T18:04:17.161-0700: 117485.156: Total time for which application threads were stopped: 0.0421350 seconds 2014-09-15T18:04:17.173-0700: 117485.168: [GC [1 CMS-initial-mark: 6009302K(8003584K)] 6059194K(9293824K), 0.0231840 secs] [Times: user=0.03 sys=0.00, real=0.03 secs] 2014-09-15T18:04:17.197-0700: 117485.192: Total time for which application threads were stopped: 0.0362080 seconds 2014-09-15T18:04:17.197-0700: 117485.192: [CMS-concurrent-mark-start] 2014-09-15T18:04:17.681-0700: 117485.677: [CMS-concurrent-mark: 0.484/0.484 secs]
RE: How often are JMX Cassandra metrics reset?
Thanks, Chris. 75thPercentile is clearly NOT lifetime: its value jumps around. However, I can tell that Max is lifetime; it's been showing the exact same value for days, on various nodes. Hence my doubts. From: Chris Lohfink [mailto:clohf...@blackbirdit.com] Sent: Thursday, August 28, 2014 3:56 PM To: user@cassandra.apache.org Subject: Re: How often are JMX Cassandra metrics reset? In the version of metrics used theres a uniform reservoir and a exponentially weighted one. This is used to compute the min, max, mean, std dev and quantiles. For the timers it uses by default it uses the exp. decaying one which is weighted for the last 5 minutes. http://grepcode.com/file/repo1.maven.org/maven2/com.yammer.metrics/metrics-core/2.2.0/com/yammer/metrics/core/Timer.java?av=f Chris Lohfink On Aug 28, 2014, at 5:39 PM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: The metrics OneMinuteRate, FIveMinuteRate, FifteenMinuteRate, and MeanRate are NOT lifetime values but they're all counts of requests, not latency. The latency values (Max, Count, 50thPercentile, Mean, etc) ARE lifetime values, I think, and thus would seem to be kinda useless for me, since our servers have been running for months. Maybe there's a way to reset lifetime metrics to zero. I connected to a cassandra server remotely via jConsole (port 7199) and I can read various metrics via the MBeans, but I don't see an operation for resetting to zero. But perhaps that's because I'm connecting remotely. ClientRequest/Read/Latency: LatencyUnit = MICROSECONDS FiveMinuteRate = 1.12 FifteenMinuteRate = 1.11 RateUnit = SECONDS MeanRate = 1.65 OneMinuteRate = 1.13 EventType = calls Max = 237,373.37 Count = 961,312 50thPercentile = 383.2 Mean = 908.46 Min = 95.64 StdDev = 3,034.62 75thPercentile = 626.34 95thPercentile = 954.31 98thPercentile = 1,443.11 99thPercentile = 1,472.4 999thPercentile = 1,858.1 From: Nick Bailey [mailto:n...@datastax.com] Sent: Thursday, August 28, 2014 1:50 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: How often are JMX Cassandra metrics reset? Those percentile values should be for the lifetime of the node yes. Depending on what version of OpsCenter you are using it is either using the 'recent' metrics described by Rob, or it is using the FiveMinuteRate from JMX as well as doing some of it's own aggregation depending on the rollup size. On Thu, Aug 28, 2014 at 12:36 PM, Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com wrote: On Thu, Aug 28, 2014 at 9:27 AM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: And yet OpsCenter shows graphs with ever-changing metrics that show recent performance. Does OpsCenter not get its stats from JMX? 1) Certain JMX endpoints expose recent metrics, or at least used to. These are recent as in since the last time someone polled this endpoint. 2) OpsCenter samples via JMX and then stores metrics in its own columnfamily. I would not be shocked if it does some minor aggregation as it does so. This all said, OpsCenter is not Apache Cassandra software, so the Apache Cassandra user mailing list may not be the ideal place for it to be discussed or supported... =Rob
Rebuilding a cassandra seed node with the same tokens and same IP address
One of our nodes is getting an increasing number of pending compactions due, we think, to https://issues.apache.org/jira/browse/CASSANDRA-7145 , which is fixed in future version 2.0.11 . (We had the same error a month ago, but at that time we were in pre-production and could just clean the disks on all the nodes and restart. Now we want to be cleverer.) To overcome the issue we figure we should just rebuild the node using the same token range, to avoid unneeded data reshuffling. So we figure we should (1) find the tokens in use on that node via nodetool ring, (2) stop cassandra on that node, (3) delete the data directory, (4) Use the tokens saved in step (1) as the initial_token list, and (5) restart the node. But the node is a seed node and cassandra won't bootstrap seed nodes. Perhaps removing that node's address from the seeds list on the other nodes (and on that node) will be sufficient. That's what Replacing a Dead Seed Nodehttp://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_seed_node.html suggests. Perhaps I can remove the ip address from the seeds list on all nodes in the cluster, restart all the nodes, and then restart the bad node with auto_bootstrap=true. I want to use the same IP address. and so I don't think I can follow the instructions at http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html, because it assumes the IP address of the dead node and the new node differ. If I just start it up it will start serving traffic and read requests will fail. It wouldn't be the end of the world (the production use isn't critical yet). Should we use nodetool rebuild $LOCAL_DC? (though I think that's mostly for adding a data center) Should I add it back in and do nodetool repair? I'm afraid that would be too slow. Again, don't want to REMOVE the node from the cluster: that would cause reshuffling of token ranges and data. I want to use the same token range. Any suggestions? Thanks, Don
RE: How often are JMX Cassandra metrics reset?
And yet OpsCenter shows graphs with ever-changing metrics that show recent performance. Does OpsCenter not get its stats from JMX? From: Robert Coli [mailto:rc...@eventbrite.com] Sent: Wednesday, August 27, 2014 12:56 PM To: user@cassandra.apache.org Subject: Re: How often are JMX Cassandra metrics reset? On Wed, Aug 27, 2014 at 12:38 PM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: I’m using JMX to retrieve Cassandra metrics. I notice that Max and Count are cumulative and aren’t reset.How often are the stats for Mean, 99tthPercentile, etc reset back to zero? If they're like the old latency numbers, they are from node startup time and are never reset. =Rob
RE: How often are JMX Cassandra metrics reset?
The metrics OneMinuteRate, FIveMinuteRate, FifteenMinuteRate, and MeanRate are NOT lifetime values but they’re all counts of requests, not latency. The latency values (Max, Count, 50thPercentile, Mean, etc) ARE lifetime values, I think, and thus would seem to be kinda useless for me, since our servers have been running for months. Maybe there’s a way to reset lifetime metrics to zero. I connected to a cassandra server remotely via jConsole (port 7199) and I can read various metrics via the MBeans, but I don’t see an operation for resetting to zero. But perhaps that’s because I’m connecting remotely. ClientRequest/Read/Latency: LatencyUnit = MICROSECONDS FiveMinuteRate = 1.12 FifteenMinuteRate = 1.11 RateUnit = SECONDS MeanRate = 1.65 OneMinuteRate = 1.13 EventType = calls Max = 237,373.37 Count = 961,312 50thPercentile = 383.2 Mean = 908.46 Min = 95.64 StdDev = 3,034.62 75thPercentile = 626.34 95thPercentile = 954.31 98thPercentile = 1,443.11 99thPercentile = 1,472.4 999thPercentile = 1,858.1 From: Nick Bailey [mailto:n...@datastax.com] Sent: Thursday, August 28, 2014 1:50 PM To: user@cassandra.apache.org Subject: Re: How often are JMX Cassandra metrics reset? Those percentile values should be for the lifetime of the node yes. Depending on what version of OpsCenter you are using it is either using the 'recent' metrics described by Rob, or it is using the FiveMinuteRate from JMX as well as doing some of it's own aggregation depending on the rollup size. On Thu, Aug 28, 2014 at 12:36 PM, Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com wrote: On Thu, Aug 28, 2014 at 9:27 AM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: And yet OpsCenter shows graphs with ever-changing metrics that show recent performance. Does OpsCenter not get its stats from JMX? 1) Certain JMX endpoints expose recent metrics, or at least used to. These are recent as in since the last time someone polled this endpoint. 2) OpsCenter samples via JMX and then stores metrics in its own columnfamily. I would not be shocked if it does some minor aggregation as it does so. This all said, OpsCenter is not Apache Cassandra software, so the Apache Cassandra user mailing list may not be the ideal place for it to be discussed or supported... =Rob
How often are JMX Cassandra metrics reset?
I'm using JMX to retrieve Cassandra metrics. I notice that Max and Count are cumulative and aren't reset.How often are the stats for Mean, 99tthPercentile, etc reset back to zero? For example, 99thPercentile shows as 1.5 mls. Over how many minutes? ClientRequest/Read/Latency: LatencyUnit = MICROSECONDS FiveMinuteRate = 1.12 FifteenMinuteRate = 1.11 RateUnit = SECONDS MeanRate = 1.65 OneMinuteRate = 1.13 EventType = calls Max = 237,373.37 Count = 961,312 50thPercentile = 383.2 Mean = 908.46 Min = 95.64 StdDev = 3,034.62 75thPercentile = 626.34 95thPercentile = 954.31 98thPercentile = 1,443.11 99thPercentile = 1,472.4 999thPercentile = 1,858.1 Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
RE: adding more nodes into the cluster
According to datastax’s documentation at http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html “By default, this setting [auto_bootstrap] is true and not listed in the cassandra.yaml file.” But http://wiki.apache.org/cassandra/StorageConfiguration says: “Default is: 'false', so that new clusters don't bootstrap immediately. You should turn this on when you start adding new nodes to a cluster that already has data on it.” So which is correct? Also, the two pages disagree on the instructions on how to add new nodes to an existing cluster. The first page says to set auto_boostrap to ‘false’ when adding a new data center to a cluster. “Setting this parameter to false prevents the new nodes from attempting to get all the data from the other nodes in the data center. When you run nodetool rebuildhttp://www.datastax.com/documentation/cassandra/2.0/cassandra/tools/toolsRebuild.html in the last step, each node is properly mapped.” The second page suggests setting auto_boostrap to ‘true’ when you add new nodes to an existing cluster: “You should turn this on when you start adding new nodes to a cluster that already has data on it.” Perhaps that applies only to new nodes to an existing data center (not a new data center to an existing cluster). So, I’m not clear what I should do. I want to add a data center to an existing cluster. If I set auto_bootstrap to true in the new nodes of the new cluster, will it stream data from the other data centers? Perhaps it will stream only NEW rows. Perhaps the purpose of doing “nodetool rebuild” is to force streaming OLD data (like a repair). It’s not clear. Maybe auto_bootstrap=true is equivalent to (auto_boostrap=false plus “nodetool rebuild”). Thoughts? Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] From: Robert Coli [mailto:rc...@eventbrite.com] Sent: Wednesday, July 16, 2014 12:31 PM To: user@cassandra.apache.org Subject: Re: adding more nodes into the cluster On Wed, Jul 16, 2014 at 12:28 PM, Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com wrote: It applies whenever one is bootstrapping a node. One is bootstrapping a node whenever one starts a node with auto_bootstrap set to true (the default) and with either one-or-more tokens in initial_token or num_tokens set. Ugh sorry : 1) starting a node 2) with auto_bootstrap:true (default) 3) initial_token or num_tokens populated 4) node has never successfully bootstrapped before, and has not therefore written the information of its successful bootstrap to the system keyspace If the node has bootstrapped before, it will not do so again unless replace_address is used. =Rob
Problem with /etc/cassandra for cassandra 2.0.8
I installed a package version of cassandra via sudo yum install cassandra20.noarch into a clean host and got: cassandra20.noarch 2.0.8-2 @datastax That resulted in a problem: /etc/cassandra/ did not exist. So I did sudo yum downgrade cassandra20.noarch and got version 2.0.7. That fixed the problem: /etc/cassandra appeared. Anyone else have a problem with version 2.0.8? I don't see any release note suggesting they moved that directory. Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience]
RE: Cassandra data retention policy
CQL lets you specify a default TTL per column family/table: and default_time_to_live=86400 . From: Redmumba [mailto:redmu...@gmail.com] Sent: Monday, April 28, 2014 12:51 PM To: user@cassandra.apache.org Subject: Re: Cassandra data retention policy Have you looked into using a TTL? You can set this per insert (unfortunately, it can't be set per CF) and values will be tombstoned after that amount of time. I.e., INSERT INTO VALUES ... TTL 15552000 Keep in mind, after the values have expired, they will essentially become tombstones--so you will still need to run clean-ups (probably daily) to clear up space. Does this help? One caveat is that this is difficult to apply to existing rows--i.e., you can't bulk-update a bunch of rows with this data. As such, another good suggestion is to simply have a secondary index on a date field of some kind, and run a bulk remove (and subsequent clean-up) daily/weekly/whatever. On Mon, Apr 28, 2014 at 11:31 AM, Han Jia johnideal...@gmail.commailto:johnideal...@gmail.com wrote: Hi guys, We have a processing system that just uses the data for the past six months in Cassandra. Any suggestions on the best way to manage the old data in order to save disk space? We want to keep it as backup but it will not be used unless we need to do recovery. Thanks in advance! -John
Logs of commitlog files
1. With cassandra 2.0.6, we have 547G of files in /var/lib/commitlog/. I started a nodetool flush 65 minutes ago; it's still running. The 17536 commitlog files have been created in the last 3 days. (The node has 2.1T of sstables data in /var/lib/cassandra/data/. This is in staging, not prod.) Why so many commit logs? Here are our commitlog-related settings in cassandra.yaml: commitlog_sync: periodic commitlog_sync_period_in_ms: 1 # The size of the individual commitlog file segments. A commitlog # archiving commitlog segments (see commitlog_archiving.properties), commitlog_segment_size_in_mb: 32 # Total space to use for commitlogs. Since commitlog segments are # segment and remove it. So a small total commitlog space will tend # commitlog_total_space_in_mb: 4096 Maybe we should set commitlog_total_space_in_mb to something other than the default. According to OpsCenter, commitlog_total_space_in_mb is None.But it seems odd that there'd be so many commit logs. The node is under heavy write load. There are about 2900 compactions pending. We are NOT archiving commitlogs, via commitlog_archiving.properties. BTW, the documentation for nodetoolhttp://wiki.apache.org/cassandra/NodeTool says: Flush Flushes memtables (in memory) to SSTables (on disk), which also enables CommitLoghttp://wiki.apache.org/cassandra/CommitLog segments to be deleted. But even after doing a flush, the /var/lib/commitlog dir still has 1G of files, even after waiting 30 minutes. Each file is 32M in size, plus or minus a few bytes. I tried this on other clusters, with much smaller amounts of data. Even restarting Cassandra doesn't help. I surmise that the 1GB of commit logs are normal: they probably allocate that space as a workspace. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
RE: Lots of commitlog files
Another thing. cassandra.yaml says: # Total space to use for commitlogs. Since commitlog segments are # mmapped, and hence use up address space, the default size is 32 # on 32-bit JVMs, and 1024 on 64-bit JVMs. # # If space gets above this value (it will round up to the next nearest # segment multiple), Cassandra will flush every dirty CF in the oldest # segment and remove it. So a small total commitlog space will tend # to cause more flush activity on less-active columnfamilies. # commitlog_total_space_in_mb: 4096 We're using a 64 bit linux with a 64 bit JVM: Java(TM) SE Runtime Environment (build 1.7.0_40-b43) Java HotSpot(TM) 64-Bit Server VM (build 24.0-b56, mixed mode) but our commit log files are each 32MB in size. Is this indicative of a bug? Shouldn't they be 1024MB in size? Don From: Donald Smith Sent: Monday, April 14, 2014 12:04 PM To: 'user@cassandra.apache.org' Subject: Logs of commitlog files 1. With cassandra 2.0.6, we have 547G of files in /var/lib/commitlog/. I started a nodetool flush 65 minutes ago; it's still running. The 17536 commitlog files have been created in the last 3 days. (The node has 2.1T of sstables data in /var/lib/cassandra/data/. This is in staging, not prod.) Why so many commit logs? Here are our commitlog-related settings in cassandra.yaml: commitlog_sync: periodic commitlog_sync_period_in_ms: 1 # The size of the individual commitlog file segments. A commitlog # archiving commitlog segments (see commitlog_archiving.properties), commitlog_segment_size_in_mb: 32 # Total space to use for commitlogs. Since commitlog segments are # segment and remove it. So a small total commitlog space will tend # commitlog_total_space_in_mb: 4096 Maybe we should set commitlog_total_space_in_mb to something other than the default. According to OpsCenter, commitlog_total_space_in_mb is None.But it seems odd that there'd be so many commit logs. The node is under heavy write load. There are about 2900 compactions pending. We are NOT archiving commitlogs, via commitlog_archiving.properties. BTW, the documentation for nodetoolhttp://wiki.apache.org/cassandra/NodeTool says: Flush Flushes memtables (in memory) to SSTables (on disk), which also enables CommitLoghttp://wiki.apache.org/cassandra/CommitLog segments to be deleted. But even after doing a flush, the /var/lib/commitlog dir still has 1G of files, even after waiting 30 minutes. Each file is 32M in size, plus or minus a few bytes. I tried this on other clusters, with much smaller amounts of data. Even restarting Cassandra doesn't help. I surmise that the 1GB of commit logs are normal: they probably allocate that space as a workspace. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
Setting gc_grace_seconds to zero and skipping nodetool repair (was RE: Timeseries with TTL)
This statement is significant: “BTW if you never delete and only ttl your values at a constant value, you can set gc=0 and forget about periodic repair of the table, saving some space, IO, CPU, and an operational step.” Setting gc_grace_seconds to zero has the effect of not storing hinted handoffs (which prevent deleted data from reappearing), I believe. “Periodic repair” refers to running “nodetool repair” (aka Anti-Entropy). I too have wondered if setting gc_grace_seconds to zero and skipping “nodetool repair” are safe. We’re using C* 2.0.6. In the 2.0.X versions, with vnodes, “nodetool repair …” is very slow (see https://issues.apache.org/jira/browse/CASSANDRA-5220 and https://issues.apache.org/jira/browse/CASSANDRA-6611).We found read repairs via “nodetool repair” unacceptably slow, even when we restricted it to one table, and often the repairs hung or failed. We also tried subrange repairs and the other options. Our app does no deletes and only rarely updates a row (if there was bad data that needs to be replaced). So it’s very tempting to set gc_grace_seconds = 0 in the table definitions and skip read repairs. But there is Cassandra documentation that warns that read repairs are necessary even if you don’t do deletes. For example, http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html says: Note: If deletions never occur, you should still schedule regular repairs. Be aware that setting a column to null is a delete. The apache wiki https://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair says: Unless your application performs no deletes, it is strongly recommended that production clusters run nodetool repair periodically on all nodes in the cluster. *IF* your operations team is sufficiently on the ball, you can get by without repair as long as you do not have hardware failure -- in that case, HintedHandoffhttps://wiki.apache.org/cassandra/HintedHandoff is adequate to repair successful updates that some replicas have missed. Hinted handoff is active for max_hint_window_in_ms after a replica fails. Full repair or re-bootstrap is necessary to re-replicate data lost to hardware failure (see below). So, if there are hardware failures, “nodetool repair” is needed. And http://planetcassandra.org/general-faq/ says: Anti-Entropy Node Repair – For data that is not read frequently, or to update data on a node that has been down for an extended period, the node repair process (also referred to as anti-entropy repair) ensures that all data on a replica is made consistent. Node repair (using the nodetool utility) should be run routinely as part of regular cluster maintenance operations. If RF=2, ReadConsistency is ONE and data failed to get replicated to the second node, then during a read might the app incorrectly return “missing data”? It seems to me that the need to run “nodetool repair” reflects a design bug; it should be automated. Don From: Laing, Michael [mailto:michael.la...@nytimes.com] Sent: Sunday, April 06, 2014 11:31 AM To: user@cassandra.apache.org Subject: Re: Timeseries with TTL Since you are using LeveledCompactionStrategy there is no major/minor compaction - just compaction. Leveled compaction does more work - your logs don't look unreasonable to me - the real question is whether your nodes can keep up w the IO. SSDs work best. BTW if you never delete and only ttl your values at a constant value, you can set gc=0 and forget about periodic repair of the table, saving some space, IO, CPU, and an operational step. If your nodes cannot keep up the IO, switch to SizeTieredCompaction and monitor read response times. Or add SSDs. In my experience, for smallish nodes running C* 2 without SSDs, LeveledCompactionStrategy can cause the disk cache to churn, reducing read performance substantially. So watch out for that. Good luck, Michael On Sun, Apr 6, 2014 at 10:25 AM, Vicent Llongo villo...@gmail.commailto:villo...@gmail.com wrote: Hi, Most of the queries to that table are just getting a range of values for a metric: SELECT val FROM metrics_5min WHERE uid = ? AND metric = ? AND ts = ? AND ts = ? I'm not sure from the logs what kind of compactions they are. This is what I see in system.log (grepping for that specific table): ... INFO [CompactionExecutor:742] 2014-04-06 13:30:11,223 CompactionTask.java (line 105) Compacting [SSTableReader(path='/mnt/disk1/cassandra/data/keyspace/metrics_5min/keyspace-metrics_5min-ic-14991-Data.db'), SSTableReader(path='/mnt/disk1/cassandra/data/keyspace/metrics_5min/keyspace-metrics_5min-ic-14990-Data.db')] INFO [CompactionExecutor:753] 2014-04-06 13:35:22,495 CompactionTask.java (line 105) Compacting [SSTableReader(path='/mnt/disk1/cassandra/data/keyspace/metrics_5min/keyspace-metrics_5min-ic-14992-Data.db'), SSTableReader(path='/mnt/disk1/cassandra/data/keyspace/metrics_5min/keyspace-metrics_5min-ic-14993-Data.db')] INFO
Question about rpms from datastax
On http://rpm.riptano.com/community/noarch/ what's the difference between cassandra20-2.0.6-1.noarch.rpmhttp://rpm.riptano.com/community/noarch/cassandra20-2.0.6-1.noarch.rpm and dsc20-2.0.6-1.noarch.rpmhttp://rpm.riptano.com/community/noarch/dsc20-2.0.6-1.noarch.rpm ? Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
nodetool scrub throws exception FileAlreadyExistsException
% time nodetool scrub -s as_reports data_report_info_2011 xss = -ea -javaagent:/usr/share/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms8192M -Xmx8192M -Xmn2048M -XX:+HeapDumpOnOutOfMemoryError -Xss256k Exception in thread main FSWriteError in /mnt/cassandra-storage/data/as_reports/data_report_info_2011/snapshots/pre-scrub-1395848747073/as_reports-data_report_info_2011-jb-3-Data.db at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:84) at org.apache.cassandra.io.sstable.SSTableReader.createLinks(SSTableReader.java:1215) at org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1817) at org.apache.cassandra.db.ColumnFamilyStore.scrub(ColumnFamilyStore.java:1123) at org.apache.cassandra.service.StorageService.scrub(StorageService.java:2197) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at sun.reflect.misc.Trampoline.invoke(Unknown Source) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at sun.reflect.misc.MethodUtil.invoke(Unknown Source) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(Unknown Source) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(Unknown Source) at com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(Unknown Source) at com.sun.jmx.mbeanserver.PerInterface.invoke(Unknown Source) at com.sun.jmx.mbeanserver.MBeanSupport.invoke(Unknown Source) at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(Unknown Source) at com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doOperation(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.access$300(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(Unknown Source) at javax.management.remote.rmi.RMIConnectionImpl.invoke(Unknown Source) at sun.reflect.GeneratedMethodAccessor42.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at sun.rmi.server.UnicastServerRef.dispatch(Unknown Source) at sun.rmi.transport.Transport$1.run(Unknown Source) at sun.rmi.transport.Transport$1.run(Unknown Source) at java.security.AccessController.doPrivileged(Native Method) at sun.rmi.transport.Transport.serviceCall(Unknown Source) at sun.rmi.transport.tcp.TCPTransport.handleMessages(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(Unknown Source) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.lang.Thread.run(Unknown Source) Caused by: java.nio.file.FileAlreadyExistsException: /mnt/cassandra-storage/data/as_reports/data_report_info_2011/snapshots/pre-scrub-1395848747073/as_reports-data_report_info_2011-jb-3-Data.db - /mnt/cassandra-storage/data/as_reports/data_report_info_2011/as_reports-data_report_info_2011-jb-3-Data.db at sun.nio.fs.UnixException.translateToIOException(Unknown Source) at sun.nio.fs.UnixException.rethrowAsIOException(Unknown Source) at sun.nio.fs.UnixFileSystemProvider.createLink(Unknown Source) at java.nio.file.Files.createLink(Unknown Source) at org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:80) ... 39 more 1.112u 0.122s 3:38.36 0.5% 0+0k 0+328io 0pf+0w That table is new and very unlikely to be corrupted. I retried the command without -s and it succeeded right away. I tried again WITH -s and it succeeded again too. Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
Question about how compaction and partition keys interact
In CQL we need to decide between using ((customer_id,type),date) as the CQL primary key for a reporting table, versus ((customer_id,date),type). We store reports for every day. If we use (customer_id,type) as the partition key (physical key), then we have a WIDE ROW where each date's data is stored in a different column. Over time, as new reports are added for different dates, the row will get wider and wider, and I thought that might cause more work for compaction. So, would a partition key of (customer_id,date) yield better compaction behavior? Again, if we use (customer_id,type) as the partition key, then over time, as new columns are added to that row for different dates, I'd think that compaction would have to merge new data for a given physical row from multiple sstables. That would make compaction expensive. But if we use (customer_id,date) as the partition key, then new data will be added to new physical rows, and so compaction would have less work to do My question is really about how compaction interacts with partition keys. Someone on the Cassandra irc channel, http://webchat.freenode.net/?channels=#cassandra, said that when partition keys overlap between sstables, there's only slightly more work to do than when they don't, for merging sstables in compaction. So he thought the first form, ((customer_id,type),date), would be better. One advantage of the first form, ((customer_id,type),date) , is that we can get all report data for all dates for a given customer and type in a single wide row -- and we do have a (uncommon) use case for such reports. If we used a primary key of ((customer_id,type,date)), then the rows would be un-wide; that wouldn't take advantage of clustering columns and (like the second form) wouldn't support the (uncommon) use case mentioned in the previous paragraph. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
RE: memory usage spikes
Prem, Did you follow the instructions at http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html?scroll=reference_ds_sxl_gf3_2k And did you install jna-3.2.7.jar into /usr/share/java, as per http://www.datastax.com/documentation/cassandra/2.0/mobile/cassandra/install/installJnaRHEL.html ? Don From: prem yadav [mailto:ipremya...@gmail.com] Sent: Wednesday, March 26, 2014 10:36 AM To: user@cassandra.apache.org Subject: Re: memory usage spikes here: ps -p `/usr/java/jdk1.6.0_37/bin/jps | awk '/Dse/ {print $1}'` uww SER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND 497 20450 0.9 31.0 4727620 2502644 ? SLl 06:55 3:28 /usr/java/jdk1.6.0_37//bin/java -ea -javaagent:/usr/share/dse/cassandra/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1968M -Xmx1968M -Xmn400M -XX:+HeapDumpOnOutOfMemoryError -Xss190k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true -Dcom.sun.management.jmxremote.port=7199 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true -Dcassandra-pidfile=/var/run/dse.pid -cp :/usr/share/dse/dse.jar:/usr/share/dse/common/commons-codec-1.6.jar:/usr/share/dse/common/commons-io-2.4.jar:/usr/share/dse/common/guava-13.0.jar:/usr/share/dse/common/jbcrypt-0.3m.jar:/usr/share/dse/common/log4j-1.2.16.jar:/usr/share/dse/common/slf4j-api-1.6.1.jar:/usr/share/dse/common/slf4j-log4j12-1.6.1.jar:/etc/dse:/usr/share/java/jna.jar:/etc/dse/cassandra:/usr/share/dse/cassandra/tools/lib/stress.jar:/usr/share/dse/cassandra/lib/antlr-2.7.7.jar:/usr/share/dse/cassandra/lib/antlr-3.2.jar:/usr/share/dse/cassandra/lib/antlr-runtime-3.2.jar:/usr/share/dse/cassandra/lib/avro-1.4.0-cassandra-1.jar:/usr/share/dse/cassandra/lib/cassandra-all-1.1.9.10.jar:/usr/share/dse/cassandra/lib/cassandra-clientutil-1.1.9.10.jar:/usr/share/dse/cassandra/lib/cassandra-thrift-1.1.9.10.jar:/usr/share/dse/cassandra/lib/commons-cli-1.1.jar:/usr/share/dse/cassandra/lib/commons-codec-1.6.jar:/usr/share/dse/cassandra/lib/commons-lang-2.4.jar:/usr/share/dse/cassandra/lib/commons-logging-1.1.1.jar:/usr/share/dse/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/dse/cassandra/lib/concurrentlinkedhashmap-lru-1.3.jar:/usr/share/dse/cassandra/lib/guava-13.0.jar:/usr/share/dse/cassandra/lib/high-scale-lib-1.1.2.jar:/usr/share/dse/cassandra/lib/httpclient-4.0.1.jar:/usr/share/dse/cassandra/lib/httpcore-4.0.1.jar:/usr/share/dse/cassandra/lib/jackson-core-asl-1.9.2.jar:/usr/share/dse/cassandra/lib/jackson-mapper-asl-1.9.2.jar:/usr/share/dse/cassandra/lib/jamm-0.2.5.jar:/usr/share/dse/cassandra/lib/jline-0.9.94.jar:/usr/share/dse/cassandra/lib/joda-time-1.6.2.jar:/usr/share/dse/cassandra/lib/json-simple-1.1.jar:/usr/share/dse/cassandra/lib/libthrift-0.7.0.jar:/usr/share/dse/cassandra/lib/log4j-1.2.16.jar:/usr/share/dse/cassandra/lib/metrics-core-2.0.3.jar:/usr/share/dse/cassandra/lib/servlet-api-2.5.jar:/usr/share/dse/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/dse/cassandra/lib/snakeyaml-1.6.jar:/usr/share/dse/cassandra/lib/snappy-java-1.0.5.jar:/usr/share/dse/cassandra/lib/snaptree-0.1.jar:/usr/share/dse/cassandra/lib/stringtemplate-3.2.jar::/usr/share/dse/solr/lib/solr-4.0.2.4-SNAPSHOT-uber.jar:/usr/share/dse/solr/lib/solr-web-4.0.2.4-SNAPSHOT.jar:/usr/share/dse/solr/conf::/usr/share/dse/tomcat/lib/annotations-api-6.0.32.jar:/usr/share/dse/tomcat/lib/catalina-6.0.32.jar:/usr/share/dse/tomcat/lib/catalina-ha-6.0.32.jar:/usr/share/dse/tomcat/lib/coyote-6.0.32.jar:/usr/share/dse/tomcat/lib/el-api-6.0.29.jar:/usr/share/dse/tomcat/lib/jasper-6.0.29.jar:/usr/share/dse/tomcat/lib/jasper-el-6.0.29.jar:/usr/share/dse/tomcat/lib/jasper-jdt-6.0.29.jar:/usr/share/dse/tomcat/lib/jsp-api-6.0.29.jar:/usr/share/dse/tomcat/lib/juli-6.0.32.jar:/usr/share/dse/tomcat/lib/servlet-api-6.0.29.jar:/usr/share/dse/tomcat/lib/tribes-6.0.32.jar:/usr/share/dse/tomcat/conf::/usr/share/dse/hadoop:/etc/dse/hadoop:/usr/share/dse/hadoop/lib/ant-1.6.5.jar:/usr/share/dse/hadoop/lib/automaton-1.11-8.jar:/usr/share/dse/hadoop/lib/commons-beanutils-1.7.0.jar:/usr/share/dse/hadoop/lib/commons-beanutils-core-1.8.0.jar:/usr/share/dse/hadoop/lib/commons-cli-1.2.jar:/usr/share/dse/hadoop/lib/commons-codec-1.4.jar:/usr/share/dse/hadoop/lib/commons-collections-3.2.1.jar:/usr/share/dse/hadoop/lib/commons-configuration-1.6.jar:/usr/share/dse/hadoop/lib/commons-digester-1.8.jar:/usr/share/dse/hadoop/lib/commons-el-1.0.jar:/usr/share/dse/hadoop/lib/commons-httpclient-3.0.1.jar:/usr/share/dse/hadoop/lib/commons-lang-2.4.jar:/u Its the spike in RAM usage. Now it is normal but keeps showing the spikes. On Wed, Mar 26, 2014 at 5:31 PM, Marcin Cabaj
RE: Question about how compaction and partition keys interact
My underlying question is about the effects of the partitioning key on compaction. Specifically, would having date as part of the partitioning key make compaction easier (because compaction wouldn't have to merge wide rows over multiple days)? According to the person on irc, it wouldn't make much difference. We care mostly about read times. If read times were all we cared about, we'd use a CQL primary key of ((customer_id,type) date), especially since it lets us efficiently iterate over all dates for a given customer and type. I also care about compaction time, and if the other primary key form decreased compaction time, I might go for it. We have terabytes of data. I don't think we ever have to query all types for a given customer or date. That is, we are always given a specific customer and type, plus usually but not always a date. Thanks, Don From: Jonathan Lacefield [mailto:jlacefi...@datastax.com] Sent: Wednesday, March 26, 2014 11:20 AM To: user@cassandra.apache.org Subject: Re: Question about how compaction and partition keys interact Don, What is the underlying question? Are trying to figure out what's going to be faster for reads or are you really concerned about storage? The recommendation typically provided is to suggest that tables are modeled based on query access, to enable the fastest read performance. In your example, will your app's queries look for 1) customer interactions by type by day, with the ability to - sort by day within a type - grab ranges of dates for at type quickly - or pull all dates (and cell data) for a type or 2) customer interactions by date by type, with the ability to - sort by type within a date - grab ranges of types for a date quickly - or pull all types data for a date We also typically recommend that partitions stay within ~100k of columns or ~100MB per partition. With your first scenario, wide row, you wouldn't hit the number of columns for ~273 years :) What's interesting in your modeling scenario is that, with the current options, you don't have the ability to easily pull all dates for a customer without specifying the type, specific dates, or using ALLOW FILTERING. Did you ever consider partitioning simply on customer and using date and type as clustering keys? Hope that helps. Jonathan Jonathan Lacefield Solutions Architect, DataStax (404) 822 3487 [Image removed by sender.]http://www.linkedin.com/in/jlacefield [Image removed by sender.]http://www.datastax.com/what-we-offer/products-services/training/virtual-training On Wed, Mar 26, 2014 at 1:22 PM, Donald Smith donald.sm...@audiencescience.commailto:donald.sm...@audiencescience.com wrote: In CQL we need to decide between using ((customer_id,type),date) as the CQL primary key for a reporting table, versus ((customer_id,date),type). We store reports for every day. If we use (customer_id,type) as the partition key (physical key), then we have a WIDE ROW where each date's data is stored in a different column. Over time, as new reports are added for different dates, the row will get wider and wider, and I thought that might cause more work for compaction. So, would a partition key of (customer_id,date) yield better compaction behavior? Again, if we use (customer_id,type) as the partition key, then over time, as new columns are added to that row for different dates, I'd think that compaction would have to merge new data for a given physical row from multiple sstables. That would make compaction expensive. But if we use (customer_id,date) as the partition key, then new data will be added to new physical rows, and so compaction would have less work to do My question is really about how compaction interacts with partition keys. Someone on the Cassandra irc channel, http://webchat.freenode.net/?channels=#cassandra, said that when partition keys overlap between sstables, there's only slightly more work to do than when they don't, for merging sstables in compaction. So he thought the first form, ((customer_id,type),date), would be better. One advantage of the first form, ((customer_id,type),date) , is that we can get all report data for all dates for a given customer and type in a single wide row -- and we do have a (uncommon) use case for such reports. If we used a primary key of ((customer_id,type,date)), then the rows would be un-wide; that wouldn't take advantage of clustering columns and (like the second form) wouldn't support the (uncommon) use case mentioned in the previous paragraph. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866tel:425.201.3900%20x%203866 C: (206) 819-5965tel:%28206%29%20819-5965 F: (646) 443-2333tel:%28646%29%20443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: ~WRD000.jpginline: image001.jpg
Speed of sstableloader
I tested bulk loading in cassandra with CQLSSTableWriter and sstableloader. It turns out that writing 1 millions rows with sstableloader took over twice as long as inserting regularly with batch CQL statements from Java (cassandra-driver-core, version 2.0.0). Specifically, the call to sstableloader shown below took just over 12 minutes, while inserting with Java batch statements took just over 5 minutes. I checked this twice and the same thing happened both times. Is this expected? Thanks, Don Here's the code (slightly edited and abbreviated): import org.apache.cassandra.exceptions.InvalidRequestException; import org.apache.cassandra.io.sstable.CQLSSTableWriter; import java.io.IOException; import java.net.URL; import java.net.URLClassLoader; import java.util.Random; // sstableloader -v -d 10.12.2.91,10.12.2.92,10.12.2.93 /tmp/test/test_table public class CreateLoadableSSTableCQL { .. // - private static void create(int count) throws IOException, InvalidRequestException { String schema = CREATE TABLE test.test_table (id text PRIMARY KEY, value text); String insert = INSERT INTO test.test_table (id, value) VALUES (?, ?); CQLSSTableWriter writer = CQLSSTableWriter.builder().inDirectory(/tmp/test/test_table).forTable(schema) .using(insert).build(); for(int i=0;icount;i++) { writer.addRow(makeRandomString(32),makeRandomString(100)); } writer.close(); } // -- public static void main(String [] args) { int count=100; //12.1 minutes using sstableloader on qa. 5.1 minutes using regular batched inserts if (args.length0) { count=Integer.parseInt(args[0]); } try { create(count); } catch (InvalidRequestException e) { e.printStackTrace(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } }
RE: Cassandra DSC 2.0.5 not starting - * could not access pidfile for Cassandra
You may need to do chown -R cassandra /var/lib/cassandra /var/log/cassandra . Don From: user 01 [mailto:user...@gmail.com] Sent: Monday, March 10, 2014 10:23 AM To: user@cassandra.apache.org Subject: Re: Cassandra DSC 2.0.5 not starting - * could not access pidfile for Cassandra $ sudo su - cassandra I don't know why but this isn't actually working. It does not switch me to cassandra user[btw .. should this actually switch me to cassandra user?? ]. This user switching on my servers does not work for users like tomcat7 user, cassandra user but works for users that were manually created by user. Actually I tested this on two of my test servers but same results on both.
RE: replication_factor: ?
Robert, please elaborate why you say To make best use of Cassandra, my minimum recommendation is usually RF=3, N=6. I surmise that with any less than 6 nodes, you'd likely perform better with a sequential/single-node solution. You need at least six nodes to overcome the overheads from concurrency. But that's a vague explanation. Thanks, Don From: Robert Coli [mailto:rc...@eventbrite.com] Sent: Friday, March 07, 2014 11:36 AM To: user@cassandra.apache.org Subject: Re: replication_factor: ? On Fri, Mar 7, 2014 at 7:26 AM, Daniel Curry daniel.cu...@arrayent.commailto:daniel.cu...@arrayent.com wrote: I would like to know on what is the rule of thumb for replication_factor: number? I think the answer is depends on how many nodes one has? IE: three nodes will be the number 3. What would happen it I put the number 2 for a three node cluster? To make best use of Cassandra, my minimum recommendation is usually RF=3, N=6. There are certainly valid use cases with lower RF or N but from what I can tell they are an order of magnitude less common. =Rob
RE: Supported Cassandra version for CentOS 5.5
Oh, I should add that I was trying to use Cassandra 2.0.X on CentOS and it needed CentOS 6.2+. Don From: Arindam Barua [mailto:aba...@247-inc.com] Sent: Wednesday, February 26, 2014 1:52 AM To: user@cassandra.apache.org Subject: RE: Supported Cassandra version for CentOS 5.5 I am running Cassandra 1.2.12 on CentOS 5.10. Was running 1.1.15 previously without any issues as well. -Arindam From: Donald Smith [mailto:donald.sm...@audiencescience.com] Sent: Tuesday, February 25, 2014 3:40 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: RE: Supported Cassandra version for CentOS 5.5 I was unable to get cassandra working with CentOS 5.X . I needed to use CentOS 6.2 or 6.4. Don From: Hari Rajendhran hari.rajendh...@tcs.commailto:hari.rajendh...@tcs.com Sent: Tuesday, February 25, 2014 2:34 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Supported Cassandra version for CentOS 5.5 Hi, Currently i am using CentOS 5.5 OS.I need a clarification on the latest cassandra version(preferably 2.0.4) that my OS supports. Best Regards Hari Krishnan Rajendhran Hadoop Admin DESS-ABIM ,Chennai BIGDATA Galaxy Tata Consultancy Services Cell:- 9677985515 Mailto: hari.rajendh...@tcs.commailto:hari.rajendh...@tcs.com Website: http://www.tcs.comhttp://www.tcs.com/ Experience certainty.IT Services Business Solutions Consulting =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
RE: Supported Cassandra version for CentOS 5.5
I was unable to get cassandra working with CentOS 5.X . I needed to use CentOS 6.2 or 6.4. Don From: Hari Rajendhran hari.rajendh...@tcs.com Sent: Tuesday, February 25, 2014 2:34 AM To: user@cassandra.apache.org Subject: Supported Cassandra version for CentOS 5.5 Hi, Currently i am using CentOS 5.5 OS.I need a clarification on the latest cassandra version(preferably 2.0.4) that my OS supports. Best Regards Hari Krishnan Rajendhran Hadoop Admin DESS-ABIM ,Chennai BIGDATA Galaxy Tata Consultancy Services Cell:- 9677985515 Mailto: hari.rajendh...@tcs.com Website: http://www.tcs.comhttp://www.tcs.com/ Experience certainty.IT Services Business Solutions Consulting =-=-= Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
Corrupted Index File exceptions in 2.0.5
We're getting exceptions like the one below using cassandra 2.0.5. A google search turns up nothing about these except the source code. Anyone have any insight? ERROR [CompactionExecutor:188] 2014-02-12 04:15:53,232 CassandraDaemon.java (line 192) Exception in thread Thread[CompactionExecutor:188,1,main] org.apache.cassandra.io.sstable.CorruptSSTableException: java.io.IOException: Corrupted Index File -jb-48064-CompressionInfo.db: read 20252 but expected 20253 chunks. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
RE: Dangers of sudo swapoff --all
I meant to say Doing sudo swapon -a on that node fixed the problem. From: Donald Smith [mailto:donald.sm...@audiencescience.com] Sent: Thursday, February 13, 2014 2:57 PM To: 'user@cassandra.apache.org' Subject: Dangers of sudo swapoff --all I followed the recommendations at http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/install/installRecommendSettings.html and did: $ sudo swapoff -all on each of the cassandra servers in my test cluster. I noticed, though, that sometimes the cassandra server and other processes on one of the nodes suddenly crashed, with no messages indicating why. It turns out that on that node there wasn't much memory, and I was running other processes, so when the OS detected that there was insufficient memory for an operation, it unceremoniously killed some processes. Doing sudo swapo -a fixed the problem. This happened on both CentOs 6.2 and CentOS 6.4. So, if you do sudo swapoff -all make sure you're not going to run out of memory! Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
Warning about copying and pasting from datastax configuration page: weird characters in config
In http://www.datastax.com/documentation/cassandra/2.0/mobile/cassandra/install/installRecommendSettings.html it says: Packaged installs: Ensure that the following settings are included in the /etc/security/limits.d/cassandra.conf file: cassandra - memlock unlimited cassandra - nofile 10 cassandra - nproc 32768 cassandra - as unlimited But when I copy and paste those four lines to linux, it inserts periods in the first two lines so it looks like this: cassandra - memlock.unlimited cassandra - nofile.10 cassandra - nproc 32768 cassandra - as unlimited This happens for both firefox and chrome. And it happened for my coworker too (though for him the spaces after “memlock” and “nofile” were deleted).If I paste to windows it doesn’t happen. Using firebug, I found the HTML source: pre class=precassandra·-·memlockensp;unlimited¶cassandra·-·nofileensp;10¶cassandra·-·nproc·32768¶cassandra·-·as·unlimited¶/pre The HTML on that page http://www.datastax.com/documentation/cassandra/2.0/mobile/cassandra/install/installRecommendSettings.html seems fragile. According to http://www.w3.org/TR/html4/sgml/entities.html: !ENTITY enspCDATA #8194; -- en space, U+2002 ISOpub -- There were other specious characters included in some config I pasted from there, and that caused headaches. Specifically, earlier I saw: cassandraâ- memlock unlimited cassandraâ- nofile 10 cassandra - nproc 32768 cassandra - as unlimited (with the weird â chars added). Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
RE: Warning about copying and pasting from datastax configuration page: weird characters in config
The same problem happens with the non-mobile page: http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/install/installRecommendSettings.html I used the mobile page because someone embedded that link in a wiki page I was referring to. (I'll change the link.) Don -Original Message- From: Michael Shuler [mailto:mshu...@pbandjelly.org] On Behalf Of Michael Shuler Sent: Tuesday, February 11, 2014 2:58 PM To: user@cassandra.apache.org Subject: Re: Warning about copying and pasting from datastax configuration page: weird characters in config On 02/11/2014 04:50 PM, Donald Smith wrote: In http://www.datastax.com/documentation/cassandra/2.0/mobile/cassandra/install/installRecommendSettings.html it says: Just curious.. why are you using the mobile site on a desktop, instead of the main page? [0] -- Michael [0] http://www.datastax.com/documentation/cassandra/2.0/webhelp/index.html#cassandra/install/installRecommendSettings.html
RE: Question about local reads with multiple data centers
I found the answer. By default, the Datastax driver for Cassandra uses the RoundRobinPolicy for deciding which Cassandra node a client read or write request should be routed to. But that policy is independent of data center. Per the documentation (http://www.datastax.com/drivers/java/2.0/apidocs/com/datastax/driver/core/policies/LoadBalancingPolicy.html) , one can see that if you have multiple data centers, it's probably better to use DCAwareRoundRobinPolicy, which gives preference to the local data center. The client program needs to know which datacenter it resides in (e.g., DC1). private void connect() { if (m_session != null) { return; } String[] components = m_cassandraNode.split(,); Builder builder = Cluster.builder(); for (String component : components) { builder.addContactPoint(component); } long start = System.currentTimeMillis(); LoadBalancingPolicy loadBalancingPolicy = new DCAwareRoundRobinPolicy(localDataCenterName); if (useTokenAwarePolicy) {loadBalancingPolicy= new TokenAwarePolicy(loadBalancingPolicy);} m_cluster = builder.withLoadBalancingPolicy(loadBalancingPolicy) .build(); m_session = m_cluster.connect(); prepareQueries(); float seconds = 0.001f * (System.currentTimeMillis() - start); System.out.println(Connected to cassandra host + m_cassandraNode + in + seconds + seconds.); } -Original Message- From: Duncan Sands [mailto:duncan.sa...@gmail.com] Sent: Thursday, January 30, 2014 1:19 AM To: user@cassandra.apache.org Subject: Re: Question about local reads with multiple data centers Hi Donald, which driver are you using? With the datastax python driver you need to use the DCAwareRoundRobinPolicy for the load balancing policy if you want the driver to distinguish between your data centres, otherwise by default it round robins robins requests amongst all nodes regardless of which data centre they are in, and regardless of which data centre the nodes you told it to connect to are in. Probably it is the same for the other datastax drivers. Best wishes, Duncan. On 30/01/14 02:07, Donald Smith wrote: We have two datacenters, DC1 and DC2 in our test cluster. Our *write* process uses a connection string with just the two hosts in DC1. Our *read* process uses a connection string just with the two hosts in DC2. We use a PropertyFileSnitch and a property file that 'DC1':2, 'DC2':1 between data centers. I notice from the *read* process's logs that the reader adds ALL the hosts (in both datacenters) to the list of queried hosts. My question: will the *read* process try to read first locally from the datacenter DC2 I specified in its connection string? I presume so. (I doubt that it uses the client's IP address to decide which datacenter is closer. And I am unaware of another way to tell it to read locally.) Also, will read repair happen between datacenters automatically (read_repair_chance=0.10)? Or does that only happen within a single data center? We're using Cassandra 2.0.4 and CQL. Thank you *Donald A. Smith*| Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.com mailto:dona...@audiencescience.com AudienceScience
Question about local reads with multiple data centers
We have two datacenters, DC1 and DC2 in our test cluster. Our write process uses a connection string with just the two hosts in DC1. Our read process uses a connection string just with the two hosts in DC2. We use a PropertyFileSnitch and a property file that 'DC1':2, 'DC2':1 between data centers. I notice from the read process's logs that the reader adds ALL the hosts (in both datacenters) to the list of queried hosts. My question: will the read process try to read first locally from the datacenter DC2 I specified in its connection string? I presume so. (I doubt that it uses the client's IP address to decide which datacenter is closer. And I am unaware of another way to tell it to read locally.) Also, will read repair happen between datacenters automatically (read_repair_chance=0.10)? Or does that only happen within a single data center? We're using Cassandra 2.0.4 and CQL. Thank you Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg
RE: No deletes - is periodic repair needed? I think not...
Last week I made a feature request to apache cassandra along these lines: https://issues.apache.org/jira/browse/CASSANDRA-6611 Don From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Monday, January 27, 2014 4:05 PM To: user@cassandra.apache.org Subject: Re: No deletes - is periodic repair needed? I think not... If you have only ttl columns, and you never update the column I would not think you need a repair. Repair cures lost deletes. If all your writes have a ttl a lost write should not matter since the column was never written to the node and thus could never be resurected on said node. Unless i am missing something. On Monday, January 27, 2014, Laing, Michael michael.la...@nytimes.commailto:michael.la...@nytimes.com wrote: Thanks Sylvain, Your assumption is correct! So I think I actually have 4 classes: 1.Regular values, no deletes, no overwrites, write heavy, variable ttl's to manage size 2.Regular values, no deletes, some overwrites, read heavy (10 to 1), fixed ttl's to manage size 2.a. Regular values, no deletes, some overwrites, read heavy (10 to 1), variable ttl's to manage size 3.Counter values, no deletes, update heavy, rotation/truncation to manage size Only 2.a. above requires me to do 'periodic repair'. What I will actually do is change my schema and applications slightly to eliminate the need for overwrites on the only table I have in that category. And I will set gc_grace_period to 0 for the tables in the updated schema and drop 'periodic repair' from the schedule. Cheers, Michael On Mon, Jan 27, 2014 at 4:22 AM, Sylvain Lebresne sylv...@datastax.commailto:sylv...@datastax.com wrote: By periodic repair, I'll assume you mean having to run repair every gc_grace period to make sure no deleted entries resurrect. With that assumption: 1. Regular values, no deletes, no overwrites, write heavy, ttl's to manage size Since 'repair within gc_grace' is about avoiding value that have been deleted to resurrect, if you do no delete nor overwrites, you're in no risk of that (and don't need to 'repair withing gc_grace'). 2. Regular values, no deletes, some overwrites, read heavy (10 to 1), ttl's to manage size It depends a bit. In general, if you always set the exact same TTL on every insert (implying you always set a TTL), then you have nothing to worry about. If the TTL varies (of if you only set TTL some of the times), then you might still need to have some periodic repairs. That being said, if there is no deletes but only TTLs, then the TTL kind of lengthen the period at which you need to do repair: instead of needing to repair withing gc_grace, you only need to repair every gc_grace + min(TTL) (where min(TTL) is the smallest TTL you set on columns). 3. Counter values, no deletes, update heavy, rotation/truncation to manage size No deletes and no TTL implies that your fine (as in, there is no need for 'repair withing gc_grace'). -- Sylvain -- Sorry this was sent from mobile. Will do less grammar and spell check than usual.
Benchanmarks of cassandra replication across data centers?
Does anyone know of any good benchmark data about cassandra replication across data centers? I'm aware of the articles below. This article http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html from netflix is about benchmarking Cassandra scalability using AWS. It shows linear scaling up to 288 nodes. I don't see much data about the cost of cross-data-center replication. This compares AWS, Rackspace and Google cloud performance for Cassandra: http://www.stackdriver.com/cassandra-aws-gce-rackspace/. AWS does well. These Netflix slides say they run a nightly repair job to make sure everything stays consistent: http://www.slideshare.net/adrianco/cassandra-performance-on-aws They also talk about backing up Cassandra. Data size reached only 5GB per node (tiny!). Slide 35 says there's a 100+ mls latency between US and EU datacenters. Slide 36 shows how to add a new datacenter with no down time (pre-load from a back-up), then do repair jobs. http://www.odbms.org/blog/2011/05/measuring-the-scalability-of-sql-and-nosql-systems/ compares cassandra with three other systems. Cassandra performed well. They open-sourced the benchmark code: it is available herehttps://github.com/brianfrankcooper/YCSB. Complex. Thanks, Don Donald A. Smith | Senior Software Engineer P: 425.201.3900 x 3866 C: (206) 819-5965 F: (646) 443-2333 dona...@audiencescience.commailto:dona...@audiencescience.com [AudienceScience] inline: image001.jpg