Re: Facebook messaging and choice of HBase over Cassandra - what can we learn?
On Sun, Nov 21, 2010 at 12:10 PM, André Fiedler fiedler.an...@googlemail.com wrote: Facebook Messaging – HBase Comes of Age http://facility9.com/2010/11/18/facebook-messaging-hbase-comes-of-age 2010/11/21 David Boxenhorn da...@lookin2.com Eventual consistency is not good enough for instant messaging. On Sun, Nov 21, 2010 at 6:32 PM, Simon Reavely simon.reav...@gmail.com wrote: (Posting this to both user + dev lists) I was reviewing the blog post on the facebook engineering blog from nov 15th http://www.facebook.com/note.php?note_id=454991608919# http://www.facebook.com/note.php?note_id=454991608919# The Underlying Technology of Messages by Kannan Muthukkaruppan http://www.facebook.com/Kannan As a cassandra user I think the key sentence for this community is: We found Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. I think it would be useful to find out more about this statement from Kannan and the facebook team. Does anyone have any contacts in the Facebook team? My goal here is to understand usage patterns and whether or not the Cassandra community can learn from this decision; maybe even understand whether the Cassandra roadmap should be influenced by this decision to address a target user base. Of course we might also conclude that its just not a Cassandra use-case! Cheers, Simon -- Simon Reavely simon.reav...@gmail.com On Sun, Nov 21, 2010 at 11:40 AM, David Boxenhorn da...@lookin2.com wrote: Eventual consistency is not good enough for instant messaging. On Sun, Nov 21, 2010 at 6:32 PM, Simon Reavely simon.reav...@gmail.com wrote: (Posting this to both user + dev lists) I was reviewing the blog post on the facebook engineering blog from nov 15th http://www.facebook.com/note.php?note_id=454991608919# http://www.facebook.com/note.php?note_id=454991608919# The Underlying Technology of Messages by Kannan Muthukkaruppan http://www.facebook.com/Kannan As a cassandra user I think the key sentence for this community is: We found Cassandra's eventual consistency model to be a difficult pattern to reconcile for our new Messages infrastructure. I think it would be useful to find out more about this statement from Kannan and the facebook team. Does anyone have any contacts in the Facebook team? My goal here is to understand usage patterns and whether or not the Cassandra community can learn from this decision; maybe even understand whether the Cassandra roadmap should be influenced by this decision to address a target user base. Of course we might also conclude that its just not a Cassandra use-case! Cheers, Simon -- Simon Reavely simon.reav...@gmail.com Jonathan Ellis pointed out a term that I like using better Tunable consistency . It seems that eventual consistency confuses everyone, that or it is an easy target of an anti Cassandra public relation campaign. If you want consistency use: WRITE.ALL + READ.ONE (hinted handoff off) WRITE.QUORUM + READ.QUORUM WRITE.ONE + READ.ALL Also I believe saying HBASE is consistent is not true. This can happen: Write to region server. - Region Server acknowledges client- write to WAL - region server fails = write lost I wonder how facebook will reconcile that. :) Not trying to be nitpicky, at hadoop world in NYC I got to sit with lots of the hbase guys and we all had a great time talking about the mutual issues and happiness both of our communities share. We can not speak for Facebook, but likely chose HBase because they have several of hadoop core developers and have a large hadoop deployment. I would say the decision was probably based on several things. Current Cassandra release does not do on line schema updates. I am sure facebook does not want to restart 10,000 cassandra servers for a schema change. Current release does not have memtable tuning per column family. The upcoming Cassandra release has support for both of these things and many many more awesome things. Facebook is on the high end of how much data they have to manage, and how many servers they have. Most people do not share that use case. We can learn that facebook chose software that was good for them based on their use case and the experience they have in house. Something everyone should do.
Re: Cassandra memtable and GC
On Mon, Nov 22, 2010 at 8:28 AM, Shotaro Kamio kamios...@gmail.com wrote: Hi Peter, I've tested again with recording LiveSSTableCount and MemtableDataSize via jmx. I guess this result supports my suspect on memtable performance because I cannot find Full GC this time. This is a result in smaller data size (160million records on cassandra) on different disk configuration from my previous post. But the general picture doesn't change. The attached files: - graph-read-throughput-diskT.png: read throughput on my client program. - graph-diskT-stat-with-jmx.png: graph of cpu load, LiveSSTableCount and logarithm of MemtableDataSize. - log-gc.20101122-12:41.160M.log.gz: GC log with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps As you can see from the second graph, logarithm of MemtableDataSize and cpu load has a clear correlation. When a memtable is flushed and a new SSTable is created (LiveSSTableCount is incremented), read performance will be recovered. But it degrades soon. I couldn't find Full GC in GC log in this test. So, I guess that this performance is not a result of GC activity. Regards, Shotaro On Sat, Nov 20, 2010 at 6:37 PM, Peter Schuller peter.schul...@infidyne.com wrote: After a memtable flush, you see minimum cpu and maximum read throughput both in term of disk and cassandra records read. As memtable increase in size, cpu goes up and read drops. If this is because of memtable or GC performance issue, this is the big question. As each memtable is just 128MB when flushed, I don't really expect GC problem or caching issues. A memtable is basically just a ConcurrentSkipListMap. Unless you are somehow triggering some kind of degenerate casein the CSLM itself, which seems unlikely, the only common circumstance where filling the memtable should be resulting in a very significant performance drop should be if you're running really close to heap size and causing additional GC asymptotally as you're growing the memtable. But that doesn't seem to be the case. I don't know, maybe I missed something in your original post, but I'm not sure what to suggest that I haven't already without further information/hands-on experimentation/observation. But running with verbose GC as I mentioned should at least be a good start (-Xloggc:path/to/gclog -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps). -- / Peter Schuller -- Shotaro Kamio As you can see from the second graph, logarithm of MemtableDataSize and cpu load has a clear correlation. This makes sense. You'll see the disk read throughput is periodically going down and up. At 17:45:00, it shows zero disk read/sec. -- This must mean that your load is being completely served from cache. If you have a very high cache hit rate CPU/Memory are the ONLY factor. If CPU and memtables are the only factor then larger memtables will start to perform slower then smaller memtables. Possibly with SSD the conventional thinking on Larger SSTables does not apply (at least for your active set)
Re: cassandra vs hbase summary (was facebook messaging)
On Mon, Nov 22, 2010 at 2:52 PM, Todd Lipcon t...@lipcon.org wrote: On Mon, Nov 22, 2010 at 10:01 AM, David Jeske dav...@gmail.com wrote: I havn't used either Cassandra or hbase, so please don't take any part of this message as me attempting to state facts about either system. However, I'm very familiar with data-storage design details, and I've worked extensively optimizing applications running on MySQL, Oracle, berkeledb (including distributed txn berkeleydb), and Google Bigtable. The recent discussion triggered by Facebook messaging using HBase helped surface many interesting design differences in the two systems. I'm writing this message both to summarize what I've read in a few different places about that topic, and to check my facts. As far as I can descern, this is a decent summary of the consistency and performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3) for an hbase acceptable workload.. (Please correct the fact if they appear wrong!) 1) Cassandra can't replicate the consistency situation of HBase. Namely, that when a write requiring a quorum fails it will never appear. Deriving from this explanation: [In Cassandra]Provided at least one node receives the write, it will eventually be written to all replicas. A failure to meet the requested ConsistencyLevel is just that; not a failure to write the data itself. Once the write is received by a node, it will eventually reach all replicas, there is no roll back. - Nick Telford [ref] [In Hbase] The DFSClient call returns when all datanodes in the pipeline have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning that it's in 0.21.0 - Jean-Daniel Cryans [ref] in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the region master never receives a response from the other two replicas; and it fails the client write, that write should never appear. Even if the region master then fails, when a new region master is elected, and it restarts and recovers, it should read HDFS blocks and accept the consensus 2/3 opinion that the log does not contain the write -- dropping the write. The write will never be seen. Not quite. The replica synchronization code is pretty messy, but basically it will take the longest replica that may have been synced, not a quorum. i.e the guarantee is that if you successfully sync() data, it will be present after replica synchronization. Unsynced data *may* be present after replica synchronization. But keep in mind that recovery is blocking in most cases - ie if the RS is writing to a pipeline and waiting on acks, and one of the nodes in the pipeline dies, then it will recover the pipeline (without the dead node) and continue syncing to the remaining two nodes. The client is still blocked at this point. If the RS itself dies, then it won't respond to the client at all, and it's anyone's guess whether the write was successful or not. The same is true if the network between client and RS dies. This is unavoidable in any system - a server can always fail *just before* sending the success message, and the write is left in maybe written state. What will *not* happen, though, is the following case: - Row contains value A - Client writes value B - RS fails - Client reads value A - Client reads again and sees value B Similarly, if client reads value B, it won't revert to value A in any circumstance. In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only one node, that write will fail to the client. Future reads R=1 will see that write or not depending on whether they contact the one server that accepted or not, until the data is propagated, at which time they will see the write. Reads R=2 will not see the write until it is propagated until at least two servers. There is no mechanism to assure that a write is either accepted by the requested number of servers or aborted. 2) Cassandra has a less efficient memory footprint data pinned in memory (or cached). With 3 replicas on Cassandra, each element of data pinned in-memory is kept in memory on 3 servers, wheras in hbase only region masters keep the data in memory, so there is only one-copy of each data element. 3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory data. HBase can answer a read-only query that is in memory from the single region-master, while Cassandra (N3/W2/R2) must read from multiple servers. (note, N3/W2/R2 still doesn't produce the same consistency situation as hbase, see #1) Yes, probably - except that it seems to me Cassandra should be able to offer lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC pause, latency for all rows hosted by that server will spike to 200ms. If one of three replicas is in a 200ms GC pause, the other two replicas will still respond quickly so latency should be less spiky in
Re: cassandra vs hbase summary (was facebook messaging)
On Mon, Nov 22, 2010 at 2:56 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Mon, Nov 22, 2010 at 2:52 PM, Todd Lipcon t...@lipcon.org wrote: On Mon, Nov 22, 2010 at 10:01 AM, David Jeske dav...@gmail.com wrote: I havn't used either Cassandra or hbase, so please don't take any part of this message as me attempting to state facts about either system. However, I'm very familiar with data-storage design details, and I've worked extensively optimizing applications running on MySQL, Oracle, berkeledb (including distributed txn berkeleydb), and Google Bigtable. The recent discussion triggered by Facebook messaging using HBase helped surface many interesting design differences in the two systems. I'm writing this message both to summarize what I've read in a few different places about that topic, and to check my facts. As far as I can descern, this is a decent summary of the consistency and performance differences between hbase and cassandra (N3/R2/W2 or N3/R1/W3) for an hbase acceptable workload.. (Please correct the fact if they appear wrong!) 1) Cassandra can't replicate the consistency situation of HBase. Namely, that when a write requiring a quorum fails it will never appear. Deriving from this explanation: [In Cassandra]Provided at least one node receives the write, it will eventually be written to all replicas. A failure to meet the requested ConsistencyLevel is just that; not a failure to write the data itself. Once the write is received by a node, it will eventually reach all replicas, there is no roll back. - Nick Telford [ref] [In Hbase] The DFSClient call returns when all datanodes in the pipeline have flushed (to the OS buffer) and ack'ed. That code comes from HDFS-200 in the 0.20-append branch and HDFS-265 for all branches after 0.20, meaning that it's in 0.21.0 - Jean-Daniel Cryans [ref] in HBase, if a write is accepted by only 1 of 3 HDFS replicas; and the region master never receives a response from the other two replicas; and it fails the client write, that write should never appear. Even if the region master then fails, when a new region master is elected, and it restarts and recovers, it should read HDFS blocks and accept the consensus 2/3 opinion that the log does not contain the write -- dropping the write. The write will never be seen. Not quite. The replica synchronization code is pretty messy, but basically it will take the longest replica that may have been synced, not a quorum. i.e the guarantee is that if you successfully sync() data, it will be present after replica synchronization. Unsynced data *may* be present after replica synchronization. But keep in mind that recovery is blocking in most cases - ie if the RS is writing to a pipeline and waiting on acks, and one of the nodes in the pipeline dies, then it will recover the pipeline (without the dead node) and continue syncing to the remaining two nodes. The client is still blocked at this point. If the RS itself dies, then it won't respond to the client at all, and it's anyone's guess whether the write was successful or not. The same is true if the network between client and RS dies. This is unavoidable in any system - a server can always fail *just before* sending the success message, and the write is left in maybe written state. What will *not* happen, though, is the following case: - Row contains value A - Client writes value B - RS fails - Client reads value A - Client reads again and sees value B Similarly, if client reads value B, it won't revert to value A in any circumstance. In Cassandra, if a write (requesting 2 or 3 copies) is accepted by only one node, that write will fail to the client. Future reads R=1 will see that write or not depending on whether they contact the one server that accepted or not, until the data is propagated, at which time they will see the write. Reads R=2 will not see the write until it is propagated until at least two servers. There is no mechanism to assure that a write is either accepted by the requested number of servers or aborted. 2) Cassandra has a less efficient memory footprint data pinned in memory (or cached). With 3 replicas on Cassandra, each element of data pinned in-memory is kept in memory on 3 servers, wheras in hbase only region masters keep the data in memory, so there is only one-copy of each data element. 3) Cassandra (N3/W2/R2) has slower reads of cached or pinned-in-memory data. HBase can answer a read-only query that is in memory from the single region-master, while Cassandra (N3/W2/R2) must read from multiple servers. (note, N3/W2/R2 still doesn't produce the same consistency situation as hbase, see #1) Yes, probably - except that it seems to me Cassandra should be able to offer lower latency in the face of java GC pauses. If an HBase RS is in a 200ms GC pause, latency for all rows hosted by that server will spike to 200ms. If one of three replicas is in a 200ms GC pause, the other
Re: cassandra vs hbase summary (was facebook messaging)
On Mon, Nov 22, 2010 at 5:14 PM, Todd Lipcon t...@lipcon.org wrote: On Mon, Nov 22, 2010 at 1:58 PM, David Jeske dav...@gmail.com wrote: On Mon, Nov 22, 2010 at 11:52 AM, Todd Lipcon t...@lipcon.org wrote: Not quite. The replica synchronization code is pretty messy, but basically it will take the longest replica that may have been synced, not a quorum. i.e the guarantee is that if you successfully sync() data, it will be present after replica synchronization. Unsynced data *may* be present after replica synchronization. But keep in mind that recovery is blocking in most cases - ie if the RS is writing to a pipeline and waiting on acks, and one of the nodes in the pipeline dies, then it will recover the pipeline (without the dead node) and continue syncing to the remaining two nodes. The client is still blocked at this point. I see. So it sounds like my statement #1 was wrong. Will the RS ever timeout the write and fail in the face of not being able to push it to HDFS? Is it correct to say: Once a write is issued to HBase, it will either catistrophicly fail (i.e. disconnect), in which case the write with either have failed or succeeded, and if it succeeded, future reads will always show that write? As opposed to Cassandra, which in all configurations where reads allow a subset of all nodes, can fail a write while having the write show a temporary period of inconsistency (depending on who you talk to) followed by the write either applying or not applying depending on whether or not it actually wrote a single node during the failure to meet the write consistency request? Yes, this seems accurate to me. Does Cassandra have any return result which distinguishes between these two states: 1 - your data was not written to any nodes (true failure) 2 - your data was written to at least 1 node, but not enough to meet your write-consistency count ? David , Return messages such as your data was written to at least 1 node but not enough to make your write-consistency count. Do not help the situation. As the client that writes the data would be aware of the inconsistency, but the other clients would not. Thus it only makes sense to pass or fail entirely. (Thought it could be an interesting error message) Right, CASSANDRA-1314 only solves the memory overhead issue. Another twist to throw in the losing writes conversation is that file systems can lose writes as well :) Unless you are choosing many synchronous options that most people do not use (IMHO) @Todd. Good catch about caching HFile blocks. My point still applies though. Caching HFIle blocks on a single node vs individual dataums on N nodes may not be more efficient. Thus terms like Slower and Less Efficient could be very misleading. Isn't caching only the item more efficient? In cases with high random read is evicting single keys more efficient then evicting blocks in terms of memory churn? These are difficult questions to answer absolutely so seeing bullet points such as '#Cassandra has slower this' are oversimplifications of complex problems.
Re: cassandra vs hbase summary (was facebook messaging)
On Mon, Nov 22, 2010 at 5:48 PM, David Jeske dav...@gmail.com wrote: On Mon, Nov 22, 2010 at 2:44 PM, David Jeske dav...@gmail.com wrote: On Mon, Nov 22, 2010 at 2:39 PM, Edward Capriolo edlinuxg...@gmail.com wrote: Return messages such as your data was written to at least 1 node but not enough to make your write-consistency count. Do not help the situation. As the client that writes the data would be aware of the inconsistency, but the other clients would not. Thus it only makes sense to pass or fail entirely. (Thought it could be an interesting error message) I should have thought about that before I sent it. Let me rephrase. Doesn't the current return message actually mean your data was written to between 0 and N nodes, but not enoguh to make your write-consistency count? I agree with you that your data was written to at least 1 node but not enough to make your write consistency count is not that useful. However, the current failure seems to merge a real failure (i.e. your data will never show up) with a possible failure (your data might show up) Personally I'd really ilke to know if my data was not written at all, and that has a very different meaning than my data was sort-of-written, but not replicated as widely as I'd like, but it someday might be, or it someday might not. If you get an UnavailableException or a TimedOutException you client needs to retry the write. The point is Cassandra has tunable consistency you can say things such as: I want to write to any node and so even if all replicas for key are down so it will get there later WRITE.ANY or I want to write to all nodes and get an exception if it does not pass on all nodes. WRITE.ALL or I want to write to One node that is a replica WRITE.ONE or I want consistent data Read.QUORUM + WRITE.QUORUM Also from your comments above you are not taking into account there are two halves of the equation. Read and Write. If you mix the two levels you can solve many of those concerns. Cassandra is a distributed system. It is NOT just like hbase. If you are worried about the edge cases associated with node failures Cassandra may not be for you. See http://en.wikipedia.org/wiki/CAP_theorem. However as you pointed out in item #5 if you lose a region server you are not going to be able to read or write that data (at all) http://www.mail-archive.com/hbase-u...@hadoop.apache.org/msg09989.html This poster talks about 3-4 minutes of outage. If you want consistency like hbase you have to live with that outage.
Re: monitoring read and write problems via log file?
On Wed, Nov 24, 2010 at 3:04 AM, Peter Schuller peter.schul...@infidyne.com wrote: I was told by a colleague that read and write problems in Cassandra can be detected by monitoring a Cassandra log file. What do you mean by problem? If you mean something like a hard I/O error or corruption causing an internal error, you should get an exception of some kind in the system log (typically /var/log/cassandra/output.log or similar, unless otherwise configured). -- / Peter Schuller At the default log level of info you should look for DroppedMessageLogger, -- backpressure is causing failures GCInspector, --garbage collector paused optimal bloom filter -- not sure this is critical but appears at times Large row -- message from compaction about a really large row STATE Down -- message from gossip about node flap STATE UP -- message from gossip about node flap Digest mismatch exception --Quorum read fixed data (I do not see this much) I use a log4j syslog appender to send info to our splunk/syslog station. I use splunk to count these events based on time buckets.
Re: Capacity problem with a lot of writes?
On Fri, Nov 26, 2010 at 10:49 AM, Peter Schuller peter.schul...@infidyne.com wrote: Making compaction parallel isn't a priority because the problem is almost always the opposite: how do we spread it out over a longer period of time instead of sharp spikes of activity that hurt read/write latency. I'd be very surprised if latency would be acceptable if you did have parallel compaction. In other words, your real problem is you need more capacity for your workload. Do you expect this to be true even with the I/O situation improved (i.e., under conditions where the additional I/O is not a problem)? It seems counter-intuitive to me that single-core compaction would make a huge impact on latency when compaction is CPU bound on a 8+ core system under moderate load (even taking into account cache coherency/NUMA etc). -- / Peter Schuller Carlos, I wanted to mention a specific technique I used to solve a situation I ran into. We had a large influx of data that pushed at our current hardware, as stated above the true answer was more hardware. However we ran into a situation where a single node failed several large compactions. We failed 2 or 3 big compactions we ended up with ~1000 SSTables for a column family. This turned into a chicken and egg situation where reads were slow because there were many sstables and extra data like tombstones. However the compaction was brutally slow from the read/write traffic. My solution was to create a side by side install on the same box, I used different data directories and different ports, /var/lib/cassandra/compact 9168 etc, moved the data to the new install and started it up. Then I ran nodetool compact on the new instance. This node was seeing no read or write traffic. I was surprised to see the machine was at 400%/1600% CPU used and not much io-wait. Compacting 600 GB of small SSTables took about 4 days. (However when sstables are larger I have compacted 400GB in 4 hours on the same hardware.) After which I moved the data file back in place and started the node back into the cluster. I have lived on both sides of the fence where i want long slow compactions or breakneck fast ones. I believe there is room for other compaction models. I am interested in systems that can optimize the case with multiple data directories for example. It seems like from my experiment a major compaction can not fully utilize hardware is specific conditions. Although knowing which ones to use where and how to automatically select the optimal strategy are interesting concerns.
Re: Using mySQL to emulate Cassandra
On Sun, Nov 28, 2010 at 11:35 AM, Tom Melendez t...@supertom.com wrote: On Sun, Nov 28, 2010 at 12:28 AM, David Boxenhorn da...@lookin2.com wrote: As our launch date approaches, I am getting increasingly nervous about Cassandra tuning. It is a mysterious black art that I haven't mastered even at the low usages that we have now. I know of a few more things I can do to improve things, but how will I know if it is enough? All this is particularly ironic since - as we are just starting out - we don't have scalability problems yet, though we hope to! How are your load tests looking? Of course, there's nothing like going live, but I expect you'll be able to simulate 2x-3x your initial launch traffic. Luckily, I have completely wrapped Cassandra in an entity mapper, so that I can easily trade in something else, perhaps temporarily, until we really need Cassandra's scalability. So, I'm thinking of emulating Cassandra with mySQL. I would use mySQL either as a simple key-value store, without joins, or map Cassandra supercolumns to mySQL columns, probably of type CLOB. Does anyone want to talk me out of this? As you said, I think you just have some cold feet. My feeling is that you did some original research and decided on Cassandra for various reasons. I think if you put the MySQL solution in now, you won't go back to the Cassandra solution, because once its live, it will be much riskier to switch. And if you feel you made a mistake in your original assessment, then great, at least you found out before launch. Whatever you choose, I would flesh out my my fears with as much detail as possible. Invest in load tests and develop contingency plans. I talked about this in 2009 a little bit here - see slide 22, we call these Defcon Levels. http://www.slideshare.net/supertom/building-configurable-applications-for-the-web The idea is prioritizing what REALLY is important if the shit hits the fan (watch out, biz folks think everything is always important) and having processes to implemen and knobs to turn and levers to pull should you get slashdotted (or facebooked, tweeted, oprahed, techcrunched or whatever we call it these days). Good luck with your launch. Thanks, Tom You should always worry about everything, but you should also have confidence in your decisions. If your worry is how your cluster will perform under load, then you should find a way to test under load. Tweeks and tunes do not make scalability (they help), hardware does. If you want to be ready to be 'slashdotted' you better have a rack of servers idling. If you just need a key-value store you may not need Cassandra. Cassandra is scalable in a different way then MySQL would be. You want convincing... (Ill try) Cassandra shards through node joins and handles replication. If you start off with a Mysql master/slave architecture, or using id hash(key) mod 3. It is not clear how you grow that cluster with demand. If you make a choice that is not scalable, when you get 'slashdotted' you will not be ready. What is worse you will have no easy way out of the problem.
Re: get_count - cassandra 0.7.x predicate limit bug?
On Tue, Nov 30, 2010 at 1:00 AM, Tyler Hobbs ty...@riptano.com wrote: What error are you getting? Remember, get_count() is still just about as much work for cassandra as getting the whole row; the only advantage is it doesn't have to send the whole row back to the client. If you're counting 3+ million columns frequently, it's time to take a look at counters. - Tyler On Fri, Nov 26, 2010 at 10:33 AM, Marcin mar...@33concept.com wrote: Hi guys, I have a key with 3million+ columns but when I am trying to run get_count on it its getting me error if setting limit more than 46000+ any ideas? In previous API there was no predicate at all so it was simply counting number of columns now its not so simple any more. Please let me know if that is a bug or I do something wrong. cheers, /Marcin +1 Tyler. The problem is you can increase the clients socket timeout as high as you like if socketTimeout rpcTimeout you should see SocketTimeoutExceptions if socketTimeout = rcpTimeout you start seeing Cassandra TimedOutExceptions. Raising the RPC Timeout is done on the server. In any case you may have to range_slice to get through a row this big and count. Also in my experience rows this large do not work well. They are particularly dangerous when combined with RowCache as bringing them into to memory and evicting them is both disk and memory intensive.
Re: how to see how many rows in each node?
On Fri, Dec 3, 2010 at 12:53 PM, Robert Coli rc...@digg.com wrote: On 12/3/10 6:09 AM, Jonathan Ellis wrote: Divide space used by average row size from cfstats On Fri, Dec 3, 2010 at 7:58 AM, Donal Zangzan...@ihep.ac.cn wrote: RT. Is there any command or api? In 0.6.x : strings /path/to/cassandra/data/Keyspace/*-Index.db | wc -l =Rob 7.0 Has an estimate keys function available somewhere inside JConsole.
Running multiple instances on a single server --micrandra ??
I am quite ready to be stoned for this thread but I have been thinking about this for a while and I just wanted to bounce these ideas of some guru's. Cassandra does allow multiple data directories, but as far as I can tell no one runs in this configuration. This is something that is very different between the hbase architecture and the Cassandra architecture. HBase borrows the concept from hadoop of JBOD configurations. HBase has many small ish (~256 MB) regions managed with Zookeeper. Cassandra has a few (1 per node) large node sized Token Ranges managed by Gossip consensus. Lets say a node has 6 300 GB disks. You have the options of RAID5, RAID6, RAID10, or RAID0. The problem I have found with these configurations are major compactions (of even large minor ones) can take a long time. Even if your disk is not heavily utilized this is a lot of data to move through. Thus node joins take a long time. Node moves take a long time. The idea behind micrandra is for a 6 disk system run 6 instances of Cassandra, one per disk. Use the RackAwareSnitch to make sure no replicas live on the same node. The downsides 1) we would have to manage 6x the instances of cassandra 2) we would have some overhead for each JVM. The upsides ? 1) Since disk/instance failure only degrades the overall performance 1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when down a disk) 2) Moves and joins have less work to do 3) Can scale up a single node by adding a single disk to an existing system (assuming the ram and cpu is light) 4) OPP would be easier to balance out hot spots (maybe not on this one in not an OPP) What does everyone thing? Does it ever make sense to run this way?
Re: Running multiple instances on a single server --micrandra ??
On Thu, Dec 9, 2010 at 10:40 PM, Bill de hÓra b...@dehora.net wrote: On Tue, 2010-12-07 at 21:25 -0500, Edward Capriolo wrote: The idea behind micrandra is for a 6 disk system run 6 instances of Cassandra, one per disk. Use the RackAwareSnitch to make sure no replicas live on the same node. The downsides 1) we would have to manage 6x the instances of cassandra 2) we would have some overhead for each JVM. The upsides ? 1) Since disk/instance failure only degrades the overall performance 1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when down a disk) 2) Moves and joins have less work to do 3) Can scale up a single node by adding a single disk to an existing system (assuming the ram and cpu is light) 4) OPP would be easier to balance out hot spots (maybe not on this one in not an OPP) What does everyone thing? Does it ever make sense to run this way? It might for read heavy loads. When I looked at this, it was pointed out to me it's simpler to run fewer bigger coarser nodes and take the entire node/server out when something goes wrong. Basically give each Cassandra a server. I wonder if it would be better to rethink compaction if that's what's driving the idea. It seems to what is biting everyone, along with GC. Bill Having 6 IP's on a machine would be a given in this setup. That is not an issue for me. It is not biting me. We all know that going from 10-20 nodes is pretty simple. However organic growth from 10-16, then a couple months later from 16 - 22, can take some effort with 300-600 GB per node, since each join and clean up can take a while. I am wondering if dividing a single large node into multiple smaller instances would make this type of growth easier.
Re: Running multiple instances on a single server --micrandra ??
On Fri, Dec 10, 2010 at 11:39 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Thu, Dec 9, 2010 at 10:40 PM, Bill de hÓra b...@dehora.net wrote: On Tue, 2010-12-07 at 21:25 -0500, Edward Capriolo wrote: The idea behind micrandra is for a 6 disk system run 6 instances of Cassandra, one per disk. Use the RackAwareSnitch to make sure no replicas live on the same node. The downsides 1) we would have to manage 6x the instances of cassandra 2) we would have some overhead for each JVM. The upsides ? 1) Since disk/instance failure only degrades the overall performance 1/6th (RAID0 you lost the entire node) (RAID5 still takes a hit when down a disk) 2) Moves and joins have less work to do 3) Can scale up a single node by adding a single disk to an existing system (assuming the ram and cpu is light) 4) OPP would be easier to balance out hot spots (maybe not on this one in not an OPP) What does everyone thing? Does it ever make sense to run this way? It might for read heavy loads. When I looked at this, it was pointed out to me it's simpler to run fewer bigger coarser nodes and take the entire node/server out when something goes wrong. Basically give each Cassandra a server. I wonder if it would be better to rethink compaction if that's what's driving the idea. It seems to what is biting everyone, along with GC. Bill Having 6 IP's on a machine would be a given in this setup. That is not an issue for me. It is not biting me. We all know that going from 10-20 nodes is pretty simple. However organic growth from 10-16, then a couple months later from 16 - 22, can take some effort with 300-600 GB per node, since each join and clean up can take a while. I am wondering if dividing a single large node into multiple smaller instances would make this type of growth easier. To clearly explain the scenario. 5 nodes cluster each node has 20 % ring. They each have 6 disks. ~ 200 GB data. Going to 10 nodes is easy. You can join each one directly between each node. However if you are going from say 5 - 8. This gets dicey. Do you calculate the ideal ring position for 10 nodes? 20% | 20% | 10% | 10% | 10% | 10% | 10% | 10% This results in three joins and several clean ups. With this choice you save time but hope you do not get to the point where the first two nodes get overloaded. If you decide to work with the ideal tokens for 8 you have many moves joins. Until we have: https://issues.apache.org/jira/browse/CASSANDRA-1418 https://issues.apache.org/jira/browse/CASSANDRA-1427 Having 6 smaller instances on a node with 6 disks. Would make it easier to keep close to balanced without having to double your cluster size each time you grow or doing a series of moves to get balanced again.
Re: N to N relationships
On Sun, Dec 12, 2010 at 3:20 AM, David Boxenhorn da...@lookin2.com wrote: You want to store every value twice? That would be a pain to maintain, and possibly lead to inconsistent data. On Fri, Dec 10, 2010 at 3:50 AM, Nick Bailey n...@riptano.com wrote: I would also recommend two column families. Storing the key as NxN would require you to hit multiple machines to query for an entire row or column with RandomPartitioner. Even with OPP you would need to pick row or columns to order by and the other would require hitting multiple machines. Two column families avoids this and avoids any problems with choosing OPP. On Thu, Dec 9, 2010 at 2:26 PM, Aaron Morton aa...@thelastpickle.com wrote: Am assuming you have one matrix and you know the dimensions. Also as you say the most important queries are to get an entire column or an entire row. I would consider using a standard CF for the Columns and one for the Rows. The key for each would be the col / row number, each cassandra column name would be the id of the other dimension and the value whatever you want. - when storing the data update both the Column and Row CF - reading a whole row/col would be simply reading from the appropriate CF. - reading an intersection is a get_slice to either col or row CF using the column_names field to identify the other dimension. You would not need secondary indexes to serve these queries. Hope that helps. Aaron On 10 Dec, 2010,at 07:02 AM, Sébastien Druon sdr...@spotuse.com wrote: I mean if I have secondary indexes. Apparently they are calculated in the background... On 9 December 2010 18:33, David Boxenhorn da...@lookin2.com wrote: What do you mean by indexing? On Thu, Dec 9, 2010 at 7:30 PM, Sébastien Druon sdr...@spotuse.com wrote: Thanks a lot for the answer What about the indexing when adding a new element? Is it incremental? Thanks again On 9 December 2010 14:38, David Boxenhorn da...@lookin2.com wrote: How about a regular CF where keys are n...@n ? Then, getting a matrix row would be the same cost as getting a matrix column (N gets), and it would be very easy to add element N+1. On Thu, Dec 9, 2010 at 1:48 PM, Sébastien Druon sdr...@spotuse.com wrote: Hello, For a specific case, we are thinking about representing a N to N relationship with a NxN Matrix in Cassandra. The relations will be only between a subset of elements, so the Matrix will mostly contain empty elements. We have a set of questions concerning this: - what is the best way to represent this matrix? what would have the best performance in reading? in writing? . a super column family with n column families, with n columns each . a column family with n columns and n lines In the second case, we would need to extract 2 kinds of information: - all the relations for a line: this should be no specific problem; - all the relations for a column: in that case we would need an index for the columns, right? and then get all the lines where the value of the column in question is not null... is it the correct way to do? When using indexes, say we want to add another element N+1. What impact in terms of time would it have on the indexation job? Thanks a lot for the answers, Best regards, Sébastien Druon Before secondary indexes the only option was to store the data twice. Yes you have to maintain this yourself. The data model only provides fast searches on the key. An index normally a separate entity with different ordering, almost the same here.
Re: unable to start cassandra-0.7r2
On Mon, Dec 13, 2010 at 5:45 PM, Eric Evans eev...@rackspace.com wrote: On Mon, 2010-12-13 at 17:27 -0500, Liangzhao Zeng wrote: I can run the 0.66 using same logging setup without any problem. Not sure what's the difference when starting up the 0.7 in eclipse. Can someone share the logging setup? Make sure that you have -Dlog4j.configuration=log4j-server.properties among your VM arguments and that conf/ (assuming that's where you have it) has been added to the classpath. Since you say this worked with 0.6.6 and doesn't with 0.7, I'm guessing the latter is already in place and the former is the problem. -- Eric Evans eev...@rackspace.com I am not sure about the logging but cassandra.config should now be a URI to your cassandra.yaml not your storage-dir. Mine looks like this. -Dcom.sun.management.jmxremote.port= -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcassandra-foreground -Dcassandra.config=file:///home/edward/idea/conf/cassandra.yaml -ea -Xmx1G
Re: Read Latency Degradation
On Fri, Dec 17, 2010 at 8:21 AM, Wayne wav...@gmail.com wrote: We have been testing Cassandra for 6+ months and now have 10TB in 10 nodes with rf=3. It is 100% real data generated by real code in an almost production level mode. We have gotten past all our stability issues, java/cmf issues, etc. etc. now to find the one thing we assumed may not be true. Our current production environment is mysql with extensive partitioning. We have mysql tables with 3-4 billion records and our query performance is the same as with 1 million records ( 100ms). For those of us really trying to manage large volumes of data memory is not an option in any stretch of the imagination. Our current data volume once placed within Cassandra ignoring growth should be around 50 TB. We run manual compaction once a week (absolutely required to keep ss table counts down) and it is taking a very long amount of time. Now that our nodes are past 1TB I am worried it will take more than a day. I was hoping everyone would respond to my posting with something must be wrong, but instead I am hearing you are off the charts good luck and be patient. Scary to say the least given our current investment in Cassandra. Is it true/expected that read latency will get worse in a linear fashion as the ss table size grows? Can anyone talk me off the fence here? We have 9 MySQL servers that now serve up 15+TB of data. Based on what we have seen we need 100 Cassandra nodes with rf=3 to give us good read latency (by keeping the node data sizes down). The cost/value equation just does not add up. Thanks in advance for any advice/experience you can provide. On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday daniel.double...@gmx.net wrote: On Dec 16, 2010, at 11:35 PM, Wayne wrote: I have read that read latency goes up with the total data size, but to what degree should we expect a degradation in performance? What is the normal read latency range if there is such a thing for a small slice of scol/cols? Can we really put 2TB of data on a node and get good read latency querying data off of a handful of CFs? Any experience or explanations would be greatly appreciated. If you really mean 2TB per node I strongly advise you to perform thorough testing with real world column sizes and the read write load you expect. Try to load test at least with a test cluster / data that represents one replication group. I.e. RF=3 - 3 nodes. And test with the consistency level you want to use. Also test ring operations (repair, adding nodes, moving nodes) while under expected load/ Combined with 'a handful of CFs' I would assume that you are expecting a considerable write load. You will get massive compaction load and with that data size the file system cache will suffer big time. You'll need loads of RAM and still ... I can only speak about 0.6 but ring management operations will become a nightmare and you will have very long running repairs. The cluster behavior changes massively with different access patterns (cold vs warm data) and data sizes. So you have to understand yours and test it. I think most generic load tests are mainly marketing instruments and I believe this is especially true for cassandra. Don't want to sound negative (I am a believer and don't regret our investment) but cassandra is no silver bullet. You really need to know what you are doing. Cheers, Daniel Yes major compactions for large sets of data do take a long time (360GB takes me about 6 hours). You said needing to compact to keep the sstable count low. This is not a good sign. My sstable counts sawtooth between 8-15 per CF through the day. If you are in a scenario where the SSTables are growing all day and only catch up at night, and you have tuned memtables, then your need more nodes likely. This means that your cluster can not really keep up with your write traffic. You know cassandra can take bursts of writes well, but if you are at the case where your sstables count is getting higher you are essentially failing behind. (You may not need 100 nodes like you are suggesting but possibly a few to get you over the fence.) I do run major compactions at night, but not on every night on every node. I do one a node a night to make sure these are splayed out over the week, With deletes on non-major compactions you may not need to do this, but we add and remove a lot of data per day so I find I have to/should. Since the nights are quite for us anyway. As for how many nodes you need...What works out better ? Big Iron: 1x (2 TB 64 GB RAM ) cost ? power ? Rack size ? Small factor: 4x (500GB 16GB RAM) cost ? power ? Rack Size ? Generally I think most are running the small factor type deployment, and generally this works better by avoiding 2GB compactions! Is it true that read latency grows linearly with sstable size? No (but it could be true in your case). As for your specific problems. More info is needed. How many nodes? How much ram
Re: Cassandra Monitoring
On Fri, Dec 17, 2010 at 5:48 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hi all just wanted to share a simple way we use to monitor cassandra internals with zabbix. We use a minimal http server which reads jmx and shows returns them in a property form. Thats read by zabbix every 30secs. That's started together with cassandra: https://gist.github.com/744761 Output looks something like: d...@caladan[~]$ curl http://b22:9090/jmxexport OperationMode=Normal Load=151.379 ReadOperations=506334 WriteOperations=865867 TotalReadLatencyMicros=6663882635 TotalWriteLatencyMicros=352292885 BytesCompacted=0 BytesTotalInProgress=0 PendingTasks=0 HeapUsed=1153810280 How / what are you monitoring? Best practices someone? Cheers, Daniel Doubleday, smeet.com, Berlin Using cacti and - http://www.jointhegrid.com/cassandra/cassandra-cacti-m6.jsp Many people are using munin good support there. Best Bractices: Monitor SSTable sizes and growth. Monitor Reads/Write sec Monitor Cache hit rate Monitor Compactions (what % of the day and average node is compacting) Monitor SSTable count (make sure you do not have to many) Monitor IO wait. (make sure you are not disk bound) Monitor JVM memory (make sure you have some overhead for bursts of traffic)
Re: Read Latency Degradation
On Fri, Dec 17, 2010 at 12:26 PM, Daniel Doubleday daniel.double...@gmx.net wrote: How much ram is dedicated to cassandra? 12gb heap (probably too high?) What is the hit rate of caches? high, 90%+ If your heap allows it I would definitely try to give more ram for fs cache. Your not using row cache so I don't see what cassandra would gain from so much memory. A question about your tests: I assume that they run isolated (you load test one cf at a time) and the results are the same byte-wise? So the only difference is that one time you are reading from a larger file? Do you see the same IO load in both tests? Do you use mem-mapped io? And if so are the number of page faults the same in both tests? In the end it could just be more physical movements of the disc heads with larger files ... On Dec 17, 2010, at 5:46 PM, Wayne wrote: Below are some answers to your questions. We have wide rows (what we like about Cassandra) and I wonder if that plays into this? We have been loading 1 keyspace in our cluster heavily in the last week so it is behind in compaction for that keyspace. I am not even looking at those read latency times as there are as many as 100+ sstables. Compaction will run tomorrow for all nodes (weekend is our slow time) and I will test the read latency there. For the keyspace/CFs that are already well compacted we are seeing a steady increase in read latency as the total sstable size grows and a linear relationship between our different keyspaces cfs sizes and the read latency for reads. How many nodes? 10 - 16 cores each (2 x quad ht cpus) How much ram per node? 24gb What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0) for data Is your ring balanced? yes, random partitioned very evenly How many column families? 4 CFs x 3 Keyspaces How much ram is dedicated to cassandra? 12gb heap (probably too high?) What type of caching are you using? Key caching What are the sizes of caches? 500k-1m values for 2 of the CFs What is the hit rate of caches? high, 90%+ What does your disk utiliztion|CPU|Memory look like at peak times? Disk goes to 90%+ under heavy read load. CPU load high as well. Latency does not change that much for single reads vs. under load (30 threads). We can keep current read latency up to 25-30 read threads if no writes or compaction is going on. We are worried about what we see in terms of latency for a single read. What are your average mean|max row size from cfstats? 30k avg/5meg max for one CF and 311k avg/855k max for the other. On average for a given sstable how large is the data bloom and index files? 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k filter, 18meg index for the other. Thanks. On Fri, Dec 17, 2010 at 10:58 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Fri, Dec 17, 2010 at 8:21 AM, Wayne wav...@gmail.com wrote: We have been testing Cassandra for 6+ months and now have 10TB in 10 nodes with rf=3. It is 100% real data generated by real code in an almost production level mode. We have gotten past all our stability issues, java/cmf issues, etc. etc. now to find the one thing we assumed may not be true. Our current production environment is mysql with extensive partitioning. We have mysql tables with 3-4 billion records and our query performance is the same as with 1 million records ( 100ms). For those of us really trying to manage large volumes of data memory is not an option in any stretch of the imagination. Our current data volume once placed within Cassandra ignoring growth should be around 50 TB. We run manual compaction once a week (absolutely required to keep ss table counts down) and it is taking a very long amount of time. Now that our nodes are past 1TB I am worried it will take more than a day. I was hoping everyone would respond to my posting with something must be wrong, but instead I am hearing you are off the charts good luck and be patient. Scary to say the least given our current investment in Cassandra. Is it true/expected that read latency will get worse in a linear fashion as the ss table size grows? Can anyone talk me off the fence here? We have 9 MySQL servers that now serve up 15+TB of data. Based on what we have seen we need 100 Cassandra nodes with rf=3 to give us good read latency (by keeping the node data sizes down). The cost/value equation just does not add up. Thanks in advance for any advice/experience you can provide. On Fri, Dec 17, 2010 at 5:07 AM, Daniel Doubleday daniel.double...@gmx.net wrote: On Dec 16, 2010, at 11:35 PM, Wayne wrote: I have read that read latency goes up with the total data size, but to what degree should we expect a degradation in performance? What is the normal read latency range if there is such a thing for a small slice of scol/cols? Can we really put 2TB of data on a node and get good read latency
Re: Which Java on Fedora? Sun's or GNU's?
On Wed, Dec 29, 2010 at 11:29 AM, Eric Evans eev...@rackspace.com wrote: On Wed, 2010-12-29 at 10:56 -0500, Edward Capriolo wrote: Cassandra pushes your JVM hard. Do not count on your distro which might provide versions of things that are 3 months to 2 years old. Come on. If it worked fine 3 months ago, then chances are it will continue to. This is one of the reasons that people choose (environmentally )stable distro releases (which are often supported for much longer than 2 years). Chosing what your distro gives your prepare to be disappointed and have to upgrade as soon as you get some respectable load. If you are using sun/oracle (That still feels strange to say JVM oracle) you want something much higher then just 1.6.0. Go for the latest and greatest 1.6.21 or higher JRE/JDK 1.6.23. FWIW, the wiki says: For Sun's jvm, this means at least u19; u21 is better. I install the JDK (not the JRE) because its a super set and hey I just might feel like compiling something. Other not so great options... rpm -Uvh --force --skip-deps (If you know you have a Java that your RPM manager does not know about) No. If this is really the situation, then it's disingenuous to offer the package at all, and it should be dropped. I don't think these command line arguments should ever appear on a public mailing list. Get source RPM strip out the Java dependency (If you know you have a Java that your RPM manager does not know about) Create a source RPM with nothing in it that PROVIDES JAVA (If you know you have a Java that your RPM manager does not know about) -- Eric Evans eev...@rackspace.com If it worked fine three months ago and you came into Cassandra IRC with a random JVM problem the first thing someone would tell you to do is probably update to the latest JVM :) Some distro's go for perceived stability over bug/performance enhancements in there package choices. For example (a major unnamed linux distribution) still ships mysql 5.0 rather then 5.1, or BerkelyDB that NEVER gets upgraded. Why? Tracking these packages and all the downstream changes from code that links to mysql or BDB would result in way to much churn, that would make them look less stable and enterprise like. Another major distribution allows anyone to submit a package, as a result they end up with hundreds/thousands of packages that NEVER get updated or supported in any meaningful way. As for Cassandra there are two key components Java and Cassandra. If you are just taking whatever the distro gives you for these things, you should probably do more research. As to not letting the cat out of the bag on what you can do with RPM. I agree, half heartedly. RPM is a glorified tar, and when it begins insisting you need 40 dependent libraries you do not really need (which is very common especially in the RPM Java world) because some applet in the buried in an example somewhere just might need x11Well I am more likely to edit the source RPM and make myself happy then let RPM install all of gnome just so the RPM is happy. In this case OpenJDK or SUN should meet the java =1.6.0 requirement. Edward
Re: The size of the data, I must be doing smth wrong....
On Wed, Jan 5, 2011 at 9:52 AM, Jonathan Ellis jbel...@gmail.com wrote: It's normal for Cassandra to use more disk space than MySQL. It's part of what we trade for not having to rewrite every row when you add a new column. SSTables that are obsoleted by a compaction are deleted asynchronously when the JVM performs a GC. http://wiki.apache.org/cassandra/MemtableSSTable On Wed, Jan 5, 2011 at 8:35 AM, nicolas lattuada nicolaslattu...@hotmail.fr wrote: Hi i have some data size issues: i am storing super columns with the following content: {a=1, b=2, c=3...n=14} i am storing it 300 000 times and i have a data size on the disk about 283Mo And in other side i have a mysql table which stores a bunch of data the schema follows: 6 varchars +100 5 ints +6 I put about 1 300 000 records on it and end up with 150Mo of data and 57Mo of index. Then i think i am certainly doing something wrong... The other thing is when i run flush and then compact the size of my data increases, then i imagine something is copied up on compaction So is there a way to remove the unused data? (cleanup doesn t seem to do the job). Any help to reduce the size of the data would be greatly apreciated! Greetings -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com Unlike datastores that are delimited or have fixed column sizes Cassandra does not. Each row is a Sorted Map of columns. A Column is a tupple of {columnname,columnvalue,time}. Also the data is not stored as tersely as it is inside mysql.
Re: Is this a good schema design to implement a social application..
On Fri, Jan 7, 2011 at 11:38 PM, Rajkumar Gupta rajkumar@gmail.com wrote: In the twissandra example, http://www.riptano.com/docs/0.6/data_model/twissandra#adding-friends , I find that they have split the materialized view of a user's homepage (like his followers list, tweets from friends) into several columnfamilies instead of putting in supercolumns inside a single SupercolumnFamily thereby making the rows skinnier, I was wandering as to which one will give better performance in terms of reads. I think skinnier will definitely have the advantage of less row mutations thus good read performance, when, only they, need to be retrieved, plus supercolumns of followerlist ,etc are avoided(this sounds good as supercolumn indexing limitations will not suck), but I still not pretty sure whether it would beneficial in terms of performance numbers, if I split the materialized view of single user into several columnfamilies instead of single row in single Supercolumnfamily. On Sat, Jan 8, 2011 at 2:05 AM, Rajkumar Gupta rajkumar@gmail.com wrote: The fact that subcolumns inside the supercolumns aren't indexed currently may suck here, whenever a small no (10-20 ) of subcolumns need to be retreived from a large list of subcolumns of a supercolumn like MyPostsIdKeysList. On Fri, Jan 7, 2011 at 9:58 PM, Raj rajkumar@gmail.com wrote: My question is in context of a social network schema design I am thinking of following schema for storing a user's data that is required as he logs in is led to his homepage:- (I aimed at a schema design such that through a single row read query all the data that would be required to put up the homepage of that user, is retreived.) UserSuperColumnFamily: { // Column Family UserIDKey: {columns: MyName, MyEmail, MyCity,...etc supercolumns: MyFollowersList, MyFollowiesList, MyPostsIdKeysList, MyInterestsList, MyAlbumsIdKeysList, MyVideoIdKeysList, RecentNotificationsForUserList, MessagesReceivedList, MessagesSentList, AccountSettingsList, RecentSelfActivityList, UpdatesFromFollowiesList } } Thus user's newfeed would be generated using superColumn: UpdatesFromFollowiesList. But the UpdatesFromFollowiesList, would obviously contain only Id of the posts and not the entire post data. Questions: 1.) What could be the problems with this design, any improvements ? 2.) Would frequent heavy overwrite operations/ row mutations (for example; when propagating the post updates for news-feed from some user to all his followies) which leads to rows ultimately being in several SSTables, will lead to degraded read performance ?? Is it suitable to use row Cache(too big row but all data required uptil user is logged in) If I do not use cache, it may be very expensive to pull the row each time a data is required for the given user since row would be in several sstables. How can I improve the read performance here The actual data of the posts from network would be retrieved using PostIdKey through subsequent read queries from columnFamily PostsSuperColumnFamily which would be like follows: PostsSuperColumnFamily:{ PostIdKey: { columns: PostOwnerId, PostBody supercolumns: TagsForPost {list of columns of all tags for the post}, PeopleWhoLikedThisPost {list of columns of UserIdKey of all the likers} } } Is this the best design to go with or are there any issues to consider here ? Thanks in anticipation of your valuable comments.! From your description UserSuperColumnFamily it seems to be both a Standard Column and a Super Column. You can not do that. However you can encode things such as MyName MyCity and MyState into a 'UserInfo' super Column column. UserInfo:MyState... (as your mentioned) Super Columns are not indexed and have to be completely de-serialized for each access. Because of this they are not widely used for anything but small keys with a few columns. This also applies to mutations as well, the row can exist in multiple SSTables until it finally gets compacted. That can result in much more storage used for an object that changes often. Most designs use composite keys or using something like JSON encoded values with Standard Column Families to achieve something like a Super Column. (SuperColumns are not always as Super as they seem :)
Re: Welcome committer Jake Luciani
Three cheers! On Thu, Jan 13, 2011 at 1:45 PM, Jake Luciani jak...@gmail.com wrote: Thanks Jonathan and Cassandra PMC! Happy to help Cassandra take over the world! -Jake On Thu, Jan 13, 2011 at 1:41 PM, Jonathan Ellis jbel...@gmail.com wrote: The Cassandra PMC has voted to add Jake as a committer. (Jake is also a committer on Thrift.) Welcome, Jake, and thanks for the hard work! -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: cassandra row cache
Is it possible that your are reading at READ.ONE and that READ.ONE only warms cache on 1 of your three nodes= 20. 2nd read warms another 60%, and by the third read all the replicas are warm? 99% ? This would be true if digest reads were not warming caches. Edward On Thu, Jan 13, 2011 at 4:07 PM, Saket Joshi sjo...@touchcommerce.com wrote: The cache is 800,000 per node , I have 15 nodes in the cluster. I see the cache value increased after the first run, the row cache hit rate was 0 for first run. For second run of the same data , the hit rate increased to 30% but on the third it jumps to 99% -Saket -Original Message- From: Chris Burroughs [mailto:chris.burrou...@gmail.com] Sent: Thursday, January 13, 2011 1:03 PM To: user@cassandra.apache.org Cc: Saket Joshi Subject: Re: cassandra row cache On 01/13/2011 02:05 PM, Saket Joshi wrote: Yes it does change. So the confusing part for me is why a cache of size 80,000 would not be fill after 1,600,000 requests. Can you observe items cached and hit rate while making the first 1.6 million row query?
Re: about the data directory
On Thu, Jan 13, 2011 at 7:56 PM, raoyixuan (Shandy) raoyix...@huawei.com wrote: I have some confused, why do the users can read the data in all nodes? I mean the data just be kept in the replica, how to achieve it? -Original Message- From: sc...@scode.org [mailto:sc...@scode.org] On Behalf Of Peter Schuller Sent: Friday, January 14, 2011 1:19 AM To: user@cassandra.apache.org Subject: Re: about the data directory So you mean just the replica node 's sstable will be changed ,right? The data will only be written to the nodes that are part of the replica set fo the row (with the exception of hinted handoff, but that's a different sstable). If all the replica node broke down, whether the users can read the data? If *all* nodes in the replica set for a particular row are down, then you won't be able to read that row, no. -- / Peter Schuller It does not matter which node you connect to. The node you connect to determines the hash of the key (or uses the key itself when using Order Preserving Partitioner) to determine which node or nodes the data should be on. If the key is on that node it returns it directly to the client. If the key is not on that node Cassandra fetches it from another node and then returns that data. The client is unaware and does not need to be concerned with where the data came from.
Re: live data migration from mysql to cassandra
On Fri, Jan 14, 2011 at 10:40 AM, ruslan usifov ruslan.usi...@gmail.com wrote: Hello Dear community please share your experience, home you make live(without stop) migration from mysql or other RDBM to cassandra There is no built in way to do this. I remember hearing at hadoop world this year that the hbase guys have a system to read mysql slave logs and replay into hbase. Since all the nosql community seems to do this maybe we can 'borrow' this idea. Edward
Re: Cassandra in less than 1G of memory?
On Fri, Jan 14, 2011 at 2:13 PM, Victor Kabdebon victor.kabde...@gmail.com wrote: Dear rajat, Yes it is possible, I have the same constraints. However I must warn you, from what I see Cassandra memory consumption is not bounded in 0.6.X on debian 64 Bit Here is an example of an instance launch in a node : root 19093 0.1 28.3 1210696 570052 ? Sl Jan11 9:08 /usr/bin/java -ea -Xms128M -Xmx512M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-0.6.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/../lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar org.apache.cassandra.thrift.CassandraDaemon Look at the second bold value, Xmx indicates the maximum memory that cassandra can use; it is set to be 512, so it could easily fit into 1 Gb. Now look at the first one, 570Mb 512 Mb. Moreover if I come back in one day the first value will be even higher. Probably around 610 Mb. Actually it increases to the point where I need to restart it otherwise other program are shut down by Linux for cassandra to further expand its memory usage... By the way it's a call to other cassandra users, am I the only one to encounter this problem ? Best regards, Victor K. 2011/1/14 Rajat Chopra rcho...@makara.com Hello. According to JVM heap size topic at http://wiki.apache.org/cassandra/MemtableThresholds , Cassandra would need atleast 1G of memory to run. Is it possible to have a running Cassandra cluster with machines that have less than that memory… say 512M? I can live with slow transactions, no compactions etc, but do not want an OutOfMemory error. The reason for a smaller bound for Cassandra is that I want to leave room for other processes to run. Please help with specific parameters to tune. Thanks, Rajat -Xmx512M is not an overall memory limit. MMAP'ed files also consume memory. Try turning disk access mode to standard not (MMAP or MMAP_INDEX_ONLY).
Re: balancing load
On Sun, Jan 16, 2011 at 11:45 AM, Karl Hiramoto k...@hiramoto.org wrote: Hi, I have a keyspace with Replication Factor: 2 and it seems though that most of my data goes to one node. What am I missing to have Cassandra balance more evenly? ./nodetool -h host1 ring Address Status State Load Owns Token 82740373310283352874863875878673027619 10.1.4.14 Up Normal 17.45 GB 77.48% 44427918469925720421829352515848570517 10.1.4.12 Up Normal 8.1 GB 8.12% 58247356085106932369828800153350419939 10.1.4.13 Up Normal 49.51 KB 1.66% 61078635599166706937511052402724559481 10.1.4.15 Up Normal 54.48 KB 6.37% 71909504454725029906187464140698793550 10.1.4.10 Up Normal 44.38 KB 6.37% 82740373310283352874863875878673027619 I use phpcasa as a client and it should randomly choose a host to connect to. -- Karl For a 5 node cluster your initial Tokens should be: tokens=5 ant -DclassToRun=hpcas.c01.InitialTokens run run: [java] 0 [java] 34028236692093846346337460743176821145 [java] 68056473384187692692674921486353642290 [java] 102084710076281539039012382229530463435 [java] 136112946768375385385349842972707284580 To see how these numbers were calculated : http://wiki.apache.org/cassandra/Operations#Token_selection Use nodetool move and nodetool cleanup to correct the imbalance of your cluster.
Re: balancing load
On Mon, Jan 17, 2011 at 2:44 AM, aaron morton aa...@thelastpickle.com wrote: The nodes will not automatically delete stale data, to do that you need to run nodetool cleanup. See step 3 in the Range Changes Bootstrap http://wiki.apache.org/cassandra/Operations#Range_changes If you are feeling paranoid before hand, you could run nodetool repair on each node in turn to make sure they have the correct data. http://wiki.apache.org/cassandra/Operations#Repairing_missing_or_inconsistent_data You may also have some tombstones in there, they will not be deleted until after GCGraceSeconds http://wiki.apache.org/cassandra/DistributedDeletes Hope that helps. Aaron On 17 Jan 2011, at 20:34, Karl Hiramoto wrote: Thanks for the help. I used nodetool move, so now each node owns 20% of the space, but it seems that the data load is still mostly on 2 nodes. nodetool --host slave4 ring Address Status State Load Owns Token 136112946768375385385349842972707284580 10.1.4.10 Up Normal 335.9 MB 20.00% 0 10.1.4.12 Up Normal 54.42 KB 20.00% 34028236692093846346337460743176821145 10.1.4.13 Up Normal 59.32 KB 20.00% 68056473384187692692674921486353642290 10.1.4.14 Up Normal 6.33 GB 20.00% 102084710076281539039012382229530463435 10.1.4.15 Up Normal 6.36 GB 20.00% 136112946768375385385349842972707284580 -- Karl Just to head the next possible problem. If you run 'nodetool cleanup' on each node and some of your nodes still have more data then others, then it probably means your are writing the majority of data to a few keys. ( you probably do not want to do that ) If that happens, you can use nodetool cfstats on each node and ensure that the 'max row compacted size' is roughly the same on all nodes. If you have one or two really big rows that could explain your imbalance.
Re: balancing load
On Mon, Jan 17, 2011 at 10:51 AM, Peter Schuller peter.schul...@infidyne.com wrote: Just to head the next possible problem. If you run 'nodetool cleanup' on each node and some of your nodes still have more data then others, then it probably means your are writing the majority of data to a few keys. ( you probably do not want to do that ) It may also be that a compact is needed if the discrepancies are within the variation expected during normal operation due to compaction (this assumes overwrites/deletions in write traffic). -- / Peter Schuller @Peter Isn't clean up a special case of compaction? IE it works as a major compaction + removes data not belonging to the node?
Re: balancing load
On Mon, Jan 17, 2011 at 1:20 PM, Karl Hiramoto k...@hiramoto.org wrote: On 01/17/11 15:54, Edward Capriolo wrote: Just to head the next possible problem. If you run 'nodetool cleanup' on each node and some of your nodes still have more data then others, then it probably means your are writing the majority of data to a few keys. ( you probably do not want to do that ) If that happens, you can use nodetool cfstats on each node and ensure that the 'max row compacted size' is roughly the same on all nodes. If you have one or two really big rows that could explain your imbalance. Well, I did a lengthy repair/cleanup on each node. but still have data mainly on two nodes (I have RF=2) ./apache-cassandra-0.7.0/bin/nodetool --host host3 ring Address Status State Load Owns Token 119098828422328462212181112601118874007 10.1.4.10 Up Normal 347.11 MB 30.00% 0 10.1.4.12 Up Normal 49.41 KB 20.00% 34028236692093846346337460743176821145 10.1.4.13 Up Normal 54.35 KB 20.00% 68056473384187692692674921486353642290 10.1.4.15 Up Normal 19.09 GB 16.21% 95643579558861158157614918209686336369 10.1.4.14 Up Normal 15.62 GB 13.79% 119098828422328462212181112601118874007 in cfstats i see: Compacted row minimum size: 1131752 Compacted row maximum size: 8582860529 Compacted row mean size: 1402350749 on the lowest used node i see: Compacted row minimum size: 0 Compacted row maximum size: 0 Compacted row mean size: 0 I basicly have MyKeyspace.Offer[UID] = value my value is at most 500 bytes. UID i just use a 12 random alpha numeric values [A-Z][0-9] Should i try and adjust my tokens to fix the imbalance or something else? I'm using Redhat EL 5.5 java -version java version 1.6.0_17 OpenJDK Runtime Environment (IcedTea6 1.7.5) (rhel-1.16.b17.el5-x86_64) OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode) I have some errors in my logs: ERROR [ReadStage:1747] 2011-01-17 18:13:53,988 DebuggableThreadPoolExecutor.java (line 103) Error in ThreadPoolExecutor java.lang.AssertionError at org.apache.cassandra.db.columniterator.SSTableNamesIterator.readIndexedColumns(SSTableNamesIterator.java:178) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:132) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:70) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1215) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077) at org.apache.cassandra.db.Table.getRow(Table.java:384) at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:636) ERROR [ReadStage:1747] 2011-01-17 18:13:53,989 AbstractCassandraDaemon.java (line 91) Fatal exception in thread Thread[ReadStage:1747,5,main] java.lang.AssertionError at org.apache.cassandra.db.columniterator.SSTableNamesIterator.readIndexedColumns(SSTableNamesIterator.java:178) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.read(SSTableNamesIterator.java:132) at org.apache.cassandra.db.columniterator.SSTableNamesIterator.init(SSTableNamesIterator.java:70) at org.apache.cassandra.db.filter.NamesQueryFilter.getSSTableColumnIterator(NamesQueryFilter.java:59) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:80) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1215) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1107) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1077) at org.apache.cassandra.db.Table.getRow(Table.java:384) at org.apache.cassandra.db.SliceByNamesReadCommand.getRow(SliceByNamesReadCommand.java:60) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:68) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:63) at java.util.concurrent.ThreadPoolExecutor.runWorker
Re: changing the replication level on the fly
On Tue, Jan 18, 2011 at 2:14 PM, Jeremy Stribling st...@nicira.com wrote: Hi, I've noticed in the new Cassandra 0.7.0 release that if I have a keyspace with a replication level of 2, but only one Cassandra node, I cannot insert anything into the system. Likely this was a bug in the old release I was using (0.6.8 -- is there a JIRA describing this problem?). However, this is a problem for our application, as we don't want to have to predefine the number of nodes, but rather start with one node, and add nodes as needed. Ideally, we could start our system with one node, and be able to insert data just on that one node. Then, when a second node is added, we can start using that node to store replicas for the keyspace. I know that 0.7.0 has a new operation for updating keyspace properties like replication level, but in the documentation there is some mention about having to run manual repair operations after using it. My question is: what happens if we do not run these repair operations? Here's what I'd like to do: 1) Start with a single node with autobootstrap=false and replication level=1. 2) Later, start a second node with autobootstrap=true and join it to the first. 3) The application detects that there are now two nodes, and issues the command to pump up the replication level to 2. 4) If it ever drops back down to one node, it will turn the replication level down again. If we do not do a repair, will all hell break lose, or will it just be the case that data inserted when there was only one node will continue to be unreplicated, but data inserted when there were two nodes will have two replicas? Thanks, Jeremy If you up your replication Factor and do not repair this is what happens: READ.QUORUM - This is safe. Over time all entries that are read will be fixed through read repair. Reads will return correct data. BUT data never read will never be copied to the new node. READ.ONE - 50% of your reads will return correct data. 50% of your Reads will return NO data the first time (based on the server your read hits). Then they will be read repaired. Second read will return the correct data. You can extrapolate the complications caused be this if you are add 10 or 15 nodes over time. You are never really sure if the data from the first node got replicated to the second, did the second get replicated to the third ? Brian hurting... CAP complicated enough...
Re: please help with multiget
On Tue, Jan 18, 2011 at 4:29 PM, Shu Zhang szh...@mediosystems.com wrote: Well, I don't think what I'm describing is complicated semantics. I think I've described general batch operation design and something that is symmetrical the batch_mutate method already on the Cassandra API. You are right, I can solve the problem with further denormalization, and the approach of making individual gets in parallel as described by Brandon will work too. I'll be doing one of these for now. But I think neither is as efficient, and I guess I'm still not sure why the multiget is designed the way it is. The problem with denormalization is you gotta make multiple row writes in place of one, adding load to the server, adding required physical space and losing atomicity on write operations. I know writes are cheap in cassandra, and you can catch failed writes and retry so these problems are not major, but it still seems clear that having a batch-get that works appropriately is a least a little better... From: Aaron Morton [aa...@thelastpickle.com] Sent: Tuesday, January 18, 2011 12:55 PM To: user@cassandra.apache.org Subject: Re: please help with multiget I think the general approach is to denormalise data to remove the need for complicated semantics when reading. Aaron On 19/01/2011, at 7:57 AM, Shu Zhang szh...@mediosystems.com wrote: Well, maybe making a batch-get is not anymore efficient on the server side but without it, you can get bottlenecked on client-server connections and client resources. If the number of requests you want to batch is on the order of connections in your pool, then yes, making gets in parallel is as good or maybe better. But what if you want to batch thousands of requests? The server I can scale out, I would want to get my requests there without needing to wait for connections on my client to free up. I just don't really understand the reasoning for designing muliget_slice the way it is. I still think if you're gonna have a batch-get request (multiget_slice), you should be able to add to the batch a reasonable number of ANY corresponding non-batch get requests. And you can't do that... Plus, it's not symmetrical to the batch-mutate. Is there a good reason for that? From: Brandon Williams [dri...@gmail.com] Sent: Monday, January 17, 2011 5:09 PM To: user@cassandra.apache.org Cc: hector-us...@googlegroups.com Subject: Re: please help with multiget On Mon, Jan 17, 2011 at 6:53 PM, Shu Zhang szh...@mediosystems.commailto:szh...@mediosystems.com wrote: Here's the method declaration for quick reference: mapstring,listColumnOrSuperColumn multiget_slice(string keyspace, liststring keys, ColumnParent column_parent, SlicePredicate predicate, ConsistencyLevel consistency_level) It looks like you must have the same SlicePredicate for every key in your batch retrieval, so what are you suppose to do when you need to retrieve different columns for different keys? Issue multiple gets in parallel yourself. Keep in mind that multiget is not an optimization, in fact, it can work against you when one key exceeds the rpc timeout, because you get nothing back. -Brandon muliget_slice is very useful I IMHO. In my testing, the roundtrip time for 1000 get requests all being acked individually is much higher then rountrip time for 200 get_slice grouped 5 at a time. For anyone that needs that type of access they are in good shape. I was also theorizing that a CF using RowCache with very, very high read rate would benefit from pooling a bunch of reads together with multiget. I do agree that the first time I looked at the multi_get_slice signature I realized I could do many of the things I was expecting from a multi-get.
Re: Cassandra on iSCSI?
On Fri, Jan 21, 2011 at 12:07 PM, Jonathan Ellis jbel...@gmail.com wrote: On Fri, Jan 21, 2011 at 2:19 AM, Mick Semb Wever m...@apache.org wrote: Of course with a SAN you'd want RF=1 since it's replicating internally. Isn't this the same case for raid-5 as well? No, because the replication is (mainly) to protect you from machine failures; if the SAN is a SPOF then putting more replicas on it doesn't help. And we want RF=2 if we need to keep reading while doing rolling restarts? Yes. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com If you are using cassandra with a SAN RF=1 makes sense because we are making the assumption the san is already replicating your data. RF2 makes good sense to be not effected by outages. Another alternative is something like linux-HA and manage each cassandra instance as a resource. This way if a head goes down another node linux ha would detect the failure and bring up that instance on another physical piece of hardware. Using LinuxHA+SAN+Cassandra would actually bring Cassandra closer to the hbase model which you have a distributed file system but the front end Cassandra acts like a region server.
Re: Lost MUTATIONS on several Cassandra nodes - no impact on the client
On Sun, Jan 23, 2011 at 6:30 AM, ruslan usifov ruslan.usi...@gmail.com wrote: 2011/1/20 Jonathan Ellis jbel...@gmail.com It guarantees that if the requested ConsistencyLevel is not achieved, client will get a TimedOutException, which is a signal you need to add capacity to handle what you are throwing at the cluster. Sorry and when UnavailableException is thows? When data can't be saved anywhere? Right. The difference is that the gossip process builds a topology of UP/DOWN hosts so Unavailable is thrown quickly. If you need ALL and one replica is known down - Unavailable. However if the coordinator believe the node was UP and the request took longer then RCPTimeout (default 10,000,000 ms) - TimedOutException
Re: Lost MUTATIONS on several Cassandra nodes - no impact on the client
On Sun, Jan 23, 2011 at 11:23 AM, ruslan usifov ruslan.usi...@gmail.com wrote: On Sun, Jan 23, 2011 at 6:30 AM, ruslan usifov ruslan.usi...@gmail.com wrote: Right. The difference is that the gossip process builds a topology of UP/DOWN hosts so Unavailable is thrown quickly. If you need ALL and one replica is known down - Unavailable. Is it possible to detect that write doesn't happen anywhere? Or it is only possible detect consistency failure? Regardless of what Exception is thrown you should retry the write from your client. If the method threw UnavailableException the write operation did not happen on any node. As the coordinator judged that it would have not succeeded. If the method threw TimedOutException the write could have succeeded on some nodes but it was not acknowledged on enough to meet the CL your requested. It would be nice if the exception could be populated more information such as TimedOutException: requested: 3 succeeded: 1 succeededList: 127.0.0.1 requestedList: 127.0.0.1,127.0.0.2,127.0.0.3 This would make the explanations more self explanatory, and would give more transparency to the clients. (knowing my luck thrift probably does not allow complex exception types)
Re: Cassandra + Puppet
On Mon, Jan 24, 2011 at 5:17 PM, Nate McCall n...@riptano.com wrote: Might be a bit out of date, but this one is useful: https://github.com/cmceniry/cassandrapuppet On Mon, Jan 24, 2011 at 3:51 PM, Aaron Morton aa...@thelastpickle.com wrote: Is anyone using puppet http://www.puppetlabs.com/ to deploy / manage cassandra ? Has anyone used this module https://github.com/plathrop/puppet-module-cassandra or using it for backups http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/cassandra_backup_is_a_snap or know of any other resources ? Cheers Aaron I notice you found my blog. This article is much more detailed. http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/easy_street_deploying_cassandra_via
Re: cassandra as session store
On Tue, Feb 1, 2011 at 12:57 PM, Anthony John chirayit...@gmail.com wrote: Not a concern - and here is why:- From the wiki arch section captioned below - eventual consistency does not have to mean inconsistent reads. The concern is the overhead for consistent reads. But remember in the use case being cited, the expensive read will happen only during failover, not all the time. More specifically: R=read replica count W=write replica count N=replication factor Q=QUORUM (Q = N / 2 + 1) If W + R N, you will have consistency W=1, R=N W=N, R=1 W=Q, R=Q where Q = N / 2 + 1 On Tue, Feb 1, 2011 at 11:47 AM, Tong Zhu tong@rms.com wrote: The problem is where to store the session data. If the session need to be accessible by more than one web servers, the external storage is needed. Cassandra only supports eventual consistency. If web server w1 saves the session at node 1 of cassendra while web server w2 retrieve the session from different node, if these two requests are close enough, there is a chance what w2 retrieved is different from what w1 saved. Is it a concern? Tong -Original Message- From: buddhasystem [mailto:potek...@bnl.gov] Sent: Tuesday, February 01, 2011 9:42 AM To: cassandra-u...@incubator.apache.org Subject: Re: cassandra as session store Most if not all modern web application frameworks support sessions. This applies to Django (with which I have most experience and also run it with X.509 security layer) but also to Ruby on Rails and Pylons. So, why would you re-invent the wheel? Too messy. It's all out there for you to use. Regards, Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/cassandra-as-session-store-tp5981871p5981961.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. This message and any attachments contain information that may be RMS Inc. confidential and/or privileged. If you are not the intended recipient (or authorized to receive for the intended recipient), and have received this message in error, any use, disclosure or distribution is strictly prohibited. If you have received this message in error, please notify the sender immediately by replying to the e-mail and permanently deleting the message from your computer and/or storage system. Ah. Eventual Consistency! Mama no! RUN! From: Download JSR-000315 Java Servlet 3.0 Final Release for Documentation, English Distributed Environments Within an application marked as distributable, all requests that are part of a session must be handled by one JVM at a time. The container must be able to handle all objects placed into instances of the HttpSession class using the setAttribute or putValue methods appropriately. The following restrictions are imposed to meet these conditions: This look to be the responsibly of the web cluster to ensure serialized access not the backend. (At least how I am reading it)
Re: How to delete bulk data from cassandra 0.6.3
On Sat, Feb 5, 2011 at 4:12 AM, Ali Ahsan ali.ah...@panasiangroup.com wrote: Any update on this? On 02/05/2011 12:53 AM, Ali Ahsan wrote: So do we need to write a script ? or its some thing i can do as a system admin without involving and developer.If yes please guide me in this case. On 02/04/2011 10:36 PM, Jonathan Ellis wrote: In that case, you should shut down the server before removing data files. On Fri, Feb 4, 2011 at 9:01 AM,roshandawr...@gmail.com wrote: I thought truncate() was not available before 0.7 (in 0.6.3)was it? --- Sent from BlackBerry -Original Message- From: Jonathan Ellisjbel...@gmail.com Date: Fri, 4 Feb 2011 08:58:35 To: useruser@cassandra.apache.org Reply-To: user@cassandra.apache.org Subject: Re: How to delete bulk data from cassandra 0.6.3 You should use truncate instead. (Then remove the snapshot truncate creates.) On Fri, Feb 4, 2011 at 2:05 AM, Ali Ahsanali.ah...@panasiangroup.com wrote: Hi All Is there any way i can delete column families data (not removing column families ) from Cassandra without effecting ring integrity.What if i delete some column families data in linux with rm command ? -- S.Ali Ahsan Senior System Engineer e-Business (Pvt) Ltd 49-C Jail Road, Lahore, P.O. Box 676 Lahore 54000, Pakistan Tel: +92 (0)42 3758 7140 Ext. 128 Mobile: +92 (0)345 831 8769 Fax: +92 (0)42 3758 0027 Email: ali.ah...@panasiangroup.com www.ebusiness-pg.com www.panasiangroup.com Confidentiality: This e-mail and any attachments may be confidential and/or privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person use it for any purpose or store or copy the information in any medium. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. We do not accept liability for any errors or omissions. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- S.Ali Ahsan Senior System Engineer e-Business (Pvt) Ltd 49-C Jail Road, Lahore, P.O. Box 676 Lahore 54000, Pakistan Tel: +92 (0)42 3758 7140 Ext. 128 Mobile: +92 (0)345 831 8769 Fax: +92 (0)42 3758 0027 Email: ali.ah...@panasiangroup.com www.ebusiness-pg.com www.panasiangroup.com Confidentiality: This e-mail and any attachments may be confidential and/or privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person use it for any purpose or store or copy the information in any medium. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. We do not accept liability for any errors or omissions. in 0.6.X pkill `pid of cassandra` rm -rf * /var/lib/cassandra/data/keyspace/CF you want to delete* (start cassandra)
Re: How to delete bulk data from cassandra 0.6.3
On Sat, Feb 5, 2011 at 11:35 AM, Ali Ahsan ali.ah...@panasiangroup.com wrote: Thanks for replying Edward Capriolo.Will this effect cassandra ring integrity? Another question is that will cassandra work properly after this operation.And will it be possible to restore deleted data from backup?. in 0.6.X pkill `pid of cassandra` rm -rf * /var/lib/cassandra/data/keyspace/CF you want to delete* (start cassandra) -- S.Ali Ahsan Senior System Engineer e-Business (Pvt) Ltd 49-C Jail Road, Lahore, P.O. Box 676 Lahore 54000, Pakistan Tel: +92 (0)42 3758 7140 Ext. 128 Mobile: +92 (0)345 831 8769 Fax: +92 (0)42 3758 0027 Email: ali.ah...@panasiangroup.com www.ebusiness-pg.com www.panasiangroup.com Confidentiality: This e-mail and any attachments may be confidential and/or privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person use it for any purpose or store or copy the information in any medium. Internet communications cannot be guaranteed to be timely, secure, error or virus-free. We do not accept liability for any errors or omissions. I am not sure what you mean by data integrity. In short, when Cassandra starts up it searches it's data directories and loads up the data, index, bloom filters, and saved caches it finds. Unless the files are corrupt it will happily load up what it finds. Restores are done by the process your described , stop server, restore files, start server.
Re: How bad is teh impact of compaction on performance?
On Sat, Feb 5, 2011 at 11:59 AM, buddhasystem potek...@bnl.gov wrote: Just wanted to see if someone with experience in running an actual service can advise me: how often do you run nodetool compact on your nodes? Do you stagger it in time, for each node? How badly is performance affected? I know this all seems too generic but then again no two clusters are created equal anyhow. Just wanted to get a feel. Thanks, Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-bad-is-teh-impact-of-compaction-on-performance-tp5995868p5995868.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. This is an interesting topic. Cassandra can now remove tombstones on non-major compaction. For some use cases you may not have to trigger nodetool compact yourself to remove tombstones. Use cases that do not to many updates, deletes may have the least need to run compaction yourself. !However! If you have smaller SSTables, or less SSTables your read operations will be more efficient. if you have downtime such as from 1AM-6AM. Going through a major compaction might shrink you dataset significantly and that will make reads better. Compaction can be more or less intensive. The largest factor is is row size. Users with large rows probably see faster compaction while smaller rows see it take a long time. You can lower the priority of the compaction thread for experimentation. As to the performance you want to get your cluster to the state where it is not compacting often. This may mean you need more nodes to handle writes. I graph the compaction information from JMX http://www.jointhegrid.com/cassandra/cassandra-cacti-m6.jsp to get a feel for how often a node is compacting on average. Also I cross reference the compaction with Read latency and IO graphs I have to see what impact compaction has on reads. Forcing a major compaction also lowers the chances a compaction will happen during the day on peak time. I major compact a few cluster nodes each night through cron (gc time 3 days). This has been good for keeping our data on disk as small as possible. Forcing the major compact at night uses IO, but i find it saves IO over the course of the day because each read seeks less on disk.
Re: How bad is teh impact of compaction on performance?
On Sat, Feb 5, 2011 at 12:48 PM, buddhasystem potek...@bnl.gov wrote: Thanks Edward. In our usage scenario, there is never downtime, it's a global 24/7 operation. What is impacted the worst, the read or write? How does a node handle compaction when there is a spike of writes coming to it? Edward Capriolo wrote: On Sat, Feb 5, 2011 at 11:59 AM, buddhasystem potek...@bnl.gov wrote: Just wanted to see if someone with experience in running an actual service can advise me: how often do you run nodetool compact on your nodes? Do you stagger it in time, for each node? How badly is performance affected? I know this all seems too generic but then again no two clusters are created equal anyhow. Just wanted to get a feel. Thanks, Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-bad-is-teh-impact-of-compaction-on-performance-tp5995868p5995868.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. This is an interesting topic. Cassandra can now remove tombstones on non-major compaction. For some use cases you may not have to trigger nodetool compact yourself to remove tombstones. Use cases that do not to many updates, deletes may have the least need to run compaction yourself. !However! If you have smaller SSTables, or less SSTables your read operations will be more efficient. if you have downtime such as from 1AM-6AM. Going through a major compaction might shrink you dataset significantly and that will make reads better. Compaction can be more or less intensive. The largest factor is is row size. Users with large rows probably see faster compaction while smaller rows see it take a long time. You can lower the priority of the compaction thread for experimentation. As to the performance you want to get your cluster to the state where it is not compacting often. This may mean you need more nodes to handle writes. I graph the compaction information from JMX http://www.jointhegrid.com/cassandra/cassandra-cacti-m6.jsp to get a feel for how often a node is compacting on average. Also I cross reference the compaction with Read latency and IO graphs I have to see what impact compaction has on reads. Forcing a major compaction also lowers the chances a compaction will happen during the day on peak time. I major compact a few cluster nodes each night through cron (gc time 3 days). This has been good for keeping our data on disk as small as possible. Forcing the major compact at night uses IO, but i find it saves IO over the course of the day because each read seeks less on disk. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-bad-is-the-impact-of-compaction-on-performance-tp5995868p5995978.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. It does not have to be downtime. It just has to be a slow time. Use your traffic graphs to run major compact at the slowest time so it is least impacting on performance. Compaction does not generally effect writes or busts or writes, especially if your writes go to a separate commit log disk. In the best case scenario compaction may not effect your performance at all. An example of this would be if your use case is near 100% reads are serviced by row cache disk is not a factor. Generally speaking if you have good fast hard disks, and only a single node is compacting at a given time the cluster absorbs this. In 0.7.0 dynamic snitch should help re-route traffic away from slower nodes for even less impact. In other words, making compaction non impacting is all about capacity.
Re: Finding the intersection results of column sets of two rows
On Sun, Feb 6, 2011 at 10:15 AM, buddhasystem potek...@bnl.gov wrote: Hello, If the amount of data is _that_ small, you'll have a much easier life with MySQL, which supports the join procedure -- because that's exactly what you want to achieve. asil klin wrote: Hi all, I want to procure the intersection of columns set of two rows (from 2 different column families). To achieve the intersection results, Can I, first retrieve all columns(around 300) from first row and just query by those column names in the second row(which contains maximum 100 000 columns) ? I am using the results during the write time not before presentation to the user, so latency wont be much concern while writing. Is it the proper way to procure intersection results of two rows ? Would love to hear your comments.. - Regards, Asil -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Finding-the-intersection-results-of-column-sets-of-two-rows-tp5997248p5997743.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. You can use multi-get when fetching lists of already know keys optimize your round rip time.
Re: Cassandra memory consumption
On Tue, Feb 8, 2011 at 4:56 PM, Victor Kabdebon victor.kabde...@gmail.com wrote: I will do that in the future and I will post my results here ( I upgraded the server to debian 6 to see if there is any change, so memory is back to normal). I will report in a few days. In the meantime I am open to any suggestion... 2011/2/8 Aaron Morton aa...@thelastpickle.com When you attach to the JVM with JConsole how much non heap memory and how much heap memory is reported on the memory tab? Xmx controls the total size of the heap memory, which excludes the permanent generation. see http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html#generation_sizing and http://blogs.suncom/jonthecollector/entry/presenting_the_permanent_generation Total non-heap memory on a 0.7 box I have is around 27M. You numbers seem large but it would be interesting to know what the JVM is reporting. Aaron On 09 Feb, 2011,at 05:57 AM, Victor Kabdebon victor.kabde...@gmail.com wrote: Information on the system : Debian 5 Jvm : victor@testhost:~/database/apache-cassandra-0.6.6$ java -version java version 1.6.0_22 Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) RAM : 2Go 2011/2/8 Victor Kabdebon victor.kabde...@gmail.com Sorry Jonathan : So most of these informations were taken using the command : sudo ps aux | grep cassandra For the nodetool information it is : /bin/nodetool --host localhost --port 8081 info Regars, Victor K. 2011/2/8 Jonathan Ellis jbel...@gmail.com I missed the part where you explained where you're getting your numbers from. On Tue, Feb 8, 2011 at 9:32 AM, Victor Kabdebon victor.kabde...@gmail.com wrote: It is really weird that I am the only one to have this issue. I restarted Cassandra today and already the memory compution is over the limit : root 1739 4.0 24.5 664968 494996 pts/4 SLl 15:51 0:12 /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Dcom.sun.management.jmxremote.port=8081 -Dcom.sun.management.jmxremotessl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -cp bin/../conf:bin/../build/classes:bin/../lib/antlr-3.1.3.jar:bin/../lib/apache-cassandra-06.6.jar:bin/../lib/avro-1.2.0-dev.jar:bin/../lib/cassandra-javautils.jar:bin/../lib/clhm-production.jar:bin/../lib/commons-cli-1.1.jar:bin/../lib/commons-codec-1.2.jar:bin/../lib/commons-collections-3.2.1.jar:bin/../lib/commons-io-1.4.jar:bin/../lib/commons-lang-2.4.jar:bin/../lib/commons-pool-1.5.4.jar:bin/../lib/google-collections-1.0.jar:bin/../lib/hadoop-core-0.20.1.jar:bin/../lib/hector-0.6.0-14.jar:bin/../lib/high-scale-lib.jar:bin/./lib/ivy-2.1.0.jar:bin/../lib/jackson-core-asl-1.4.0.jar:bin/../lib/jackson-mapper-asl-1.4.0.jar:bin/../lib/jline-0.9.94.jar:bin/../lib/jna.jar:bin/../lib/json-simple-1.1.jar:bin/../lib/libthrift-r917130.jar:bin/../lib/log4j-1.2.14.jar:bin/../lib/perf4j-0.9.12.jar:bin/../lib/slf4j-api-1.5.8.jar:bin/../lib/slf4j-log4j12-1.5.8.jar:bin/../lib/uuid-3.1.jar org.apache.cassandra.thrift.CassandraDaemon It is really an annoying problem if we cannot really foresee memory consumption. Best regards, Victor K 2011/2/8 Victor Kabdebon victor.kabde...@gmail.com Dear all, Sorry to come back again to this point but I am really worried about Cassandra memory consumption. I have a single machine that runs one Cassandra server. There is almost no data on it but I see a crazy memory consumption and it doesn't care at all about the instructions... Note that I am not using mmap, but Standard, I use also JNA (inside lib folder), i am running on debian 5 64 bits, so a pretty normal configuration. I also use Cassandra 0.6.8. Here are the informations I gathered on Cassandra : 105 16765 0.1 34.1 1089424 687476 ? Sl Feb02 14:58I think you are /usr/bin/java -ea -Xms128M -Xmx256M -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -Dcom.sunmanagement.jmxremote.port=8081 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dstorage-config=bin/../conf -Dcassandra-foreground=yes -cp
Re: Specifying row caching on per query basis ?
On Wed, Feb 9, 2011 at 2:43 PM, Ertio Lew ertio...@gmail.com wrote: Is this under consideration for future releases ? or being thought about!? On Thu, Feb 10, 2011 at 12:56 AM, Jonathan Ellis jbel...@gmail.com wrote: Currently there is not. On Wed, Feb 9, 2011 at 12:04 PM, Ertio Lew ertio...@gmail.com wrote: Is there any way to specify on per query basis(like we specify the Consistency level), what rows be cached while you're reading them, from a row_cache enabled CF. I believe, this could lead to much more efficient use of the cache space!!( if you use same data for different features/ parts in your application which have different caching needs). -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com I have mentioned a suggested implemented inside this issue. https://issues.apache.org/jira/browse/CASSANDRA-2035
Re: Default Listen Port
On Wed, Feb 9, 2011 at 4:00 PM, jeremy.truel...@barclayscapital.com wrote: What’s the easiest way to change the port nodes listen for comm on from other nodes? It appears that the default is 8080 which collides with my tomcat server on one of our dev boxes. I tried doing something in cassandra.yaml like listen_address: 192.1.fake.2: but that doesn’t work it throws an exception. Also can you not put the actual name of servers in the config or does it always have to be the actual ip address currently? Thanks. jt ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. ___ You are having a collision on 8080 which is the default JMX port. In conf/cassandra-env.sh look for JMX_PORT=8080 9160 is the thrift port used by clients 7000 is the storage port (used between nodes) If you change the jmx port you have specify it when using nodetool, 'nodetool -h localhost -p new port ring'
Re: Is Avro still supported?
https://issues.apache.org/jira/browse/CASSANDRA-926 On Sat, Feb 12, 2011 at 8:27 AM, Joshua Partogi joshua.j...@gmail.com wrote: Hi, I saw in the latest source in trunk, avro codes has been deleted. Does this mean Avro is not supported anymore? If so, what was the decision behind dropping the support for Avro? Thanks -- http://twitter.com/jpartogi
Re: Does Cassandra support multiple listen_address and rpc_address?
On Sun, Feb 13, 2011 at 1:39 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote: multiple network paths for inner-cluster communication will boost performance Thanks. Xiaobo Gu No. Each node has a single IP. You can boost performance in a similar way with Ethernet bonding, or 10G
Re: consistency question
On Tue, Feb 15, 2011 at 3:59 AM, Serdar Irmak sir...@protel.com.tr wrote: Hi, In a 3 node named (named A,B,C) setup with replication factor 3 and quorum read/write scenario; suppose a new value of data X is written to A and B but not C with any reason, then A wend down and I fired D with the data of C or with an empty data where in a case is X is not present in D. Then when I read quorum, nodes C and D responded and gave me the old value (then read repair in background). So doesn’t it mean there is no constistency with quorum, too ? My best Serdar The consistency rules do NOT apply if you introduce a new node without properly bootstrapping it. If you have A,B,C and A fails you should 1) 'nodetool removetoken A'. 2) Start node D with auto_boostrap=true. You can start a node empty (with bootstrap=false) using quorum/quorum, but if you do not 'nodetool repair' it before another node fails you end up with the situation you described. Edward
Re: What is the most solid version of Cassandra? No secondary indexes needed.
On Tue, Feb 15, 2011 at 3:03 PM, buddhasystem potek...@bnl.gov wrote: Thank you! It's just that 7.1 seems the bleeding edge now (a serious bug fixed today). Would you still trust it as a production-level service? I'm just slightly concerned. I don't want to create a perception among our IT that the product is not ready for prime time. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/What-is-the-most-solid-version-of-Cassandra-No-secondary-indexes-needed-tp6028966p6029047.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. You are not going to want to go through the 6.X API to 7.0 API migration. I am still happily running 0.6.8 But I know I need the features in 0.7.X. If i were starting today I would go with the 0.7.X branch and be ready to do some minor updates in the next couple months.
Re: Replica details
On Thu, Feb 17, 2011 at 1:41 PM, A J s5a...@gmail.com wrote: Where can I get good detailed explanation of the various replication options (Simple, Old Network and Network) along with snitches. I did read the definitive guide but not really satisfied. Is there a good post somewhere explaining this ? I will have 4 datacenters (assume) and 3 nodes in each DC. I wish to have one and only one copy of complete database in each DC. I wish to understand how will the ring placement look like. Thanks. AJ I hate to break this to you, but the Definite Guide probably has the best information (including diagrams) out there (that I know of). Because of all the possible permutations of multi-datacenter setups it is going to be difficult to find some doc/presentation that describes EXACTLY what you want to do and how it will work. Here are some hints: You can setup a simulation cluster. Give each host ips such as 127.0.0.1 127.0.0.2 since you do not have to explicitly configure those on a single host. Set the XMX low for each instance. Run in foreground and/or set your logging to verbose so you can debug which nodes data land on.
Re: Does servers with different capacities in a cluster affect the overall performance?
On Tue, Feb 22, 2011 at 5:13 AM, XiaoboGu guxiaobo1...@gmail.com wrote: I mean servers with different CPU cores ,memory, or disk space, does Cassandra allow this kind of configuration? This is allowed but managing this may be more difficult in production. Most settings are applied globally at the column family level, such as memtable_flush_mb for example. This means that the you will you will never be able to get tuning settings perfect because you will always have to take a middle ground approach. Moreover the Random partitioner works best when each node has an equal share of data. Unbalanced ring is the enemy because nodes with more data see more requests, and each request has to work through more data. Thus unbalanced nodes typically become the ones that start showing performance issues first. It also becomes really difficult to diagnose performance issues with an increasing number of variables (this node has 2x data but 4x the ram of node X, and 30% the processing power of node Y.) Short of suggesting hardware I hint that 1U's and Blades are good platforms over big iron because scale out is less difficult then scale up. Drastically mismatched hardware is something i would avoid.
Re: Distribution Factor: part of the solution to many-CF problem?
On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn da...@lookin2.com wrote: No, that's not what I mean at all. That message is about the ability to use different partitioners for different CFs, say, RandomPartitioner for one, OPP for another. I'm talking about defining how many nodes a CF should be distributed over, which would be useful if you have a lot of nodes and a lot of small CFs (small relative to the total amount of data). On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton aa...@thelastpickle.com wrote: Sounds a bit like this idea http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html Aaron On 22/02/2011, at 1:28 AM, David Boxenhorn da...@lookin2.com wrote: Cassandra is both distributed and replicated. We have Replication Factor but no Distribution Factor! Distribution Factor would define over how many nodes a CF should be distributed. Say you want to support millions of multi-tenant users in clusters with thousands of nodes, where you don't know the user's schema in advance, so you can't have users share CFs. In this case you wouldn't want to spread out each user's Column Families over thousands of nodes! You would want something like: RF=3, DF=10 i.e. distribute each CF over 10 nodes, within those nodes replicate 3 times. One implementation of DF would be to hash the CF name, and use the same strategies defined for RF to choose the N nodes in DF=N. The single partitioner is baked in Here is a possible solution. Use OOP, but md5 hash your keys client side. This solves that, but when you have keyspaces using OOP but with different key distributions this falls apart.
Re: Distribution Factor: part of the solution to many-CF problem?
On Tue, Feb 22, 2011 at 2:49 PM, Aaron Morton aa...@thelastpickle.com wrote: The single partitioner is baked in That was my point. You could perhaps write a partitioner that considers the CF when deciding what nodes to put data on. Off the top of my head the partitioner is not told about the CF the key is storing in. Aaron On 23/02/2011, at 6:01 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Mon, Feb 21, 2011 at 5:14 PM, David Boxenhorn da...@lookin2.com wrote: No, that's not what I mean at all. That message is about the ability to use different partitioners for different CFs, say, RandomPartitioner for one, OPP for another. I'm talking about defining how many nodes a CF should be distributed over, which would be useful if you have a lot of nodes and a lot of small CFs (small relative to the total amount of data). On Mon, Feb 21, 2011 at 9:58 PM, Aaron Morton aa...@thelastpickle.com wrote: Sounds a bit like this idea http://www.mail-archive.com/dev@cassandra.apache.org/msg01799.html Aaron On 22/02/2011, at 1:28 AM, David Boxenhorn da...@lookin2.com wrote: Cassandra is both distributed and replicated. We have Replication Factor but no Distribution Factor! Distribution Factor would define over how many nodes a CF should be distributed. Say you want to support millions of multi-tenant users in clusters with thousands of nodes, where you don't know the user's schema in advance, so you can't have users share CFs. In this case you wouldn't want to spread out each user's Column Families over thousands of nodes! You would want something like: RF=3, DF=10 i.e. distribute each CF over 10 nodes, within those nodes replicate 3 times. One implementation of DF would be to hash the CF name, and use the same strategies defined for RF to choose the N nodes in DF=N. The single partitioner is baked in Here is a possible solution. Use OOP, but md5 hash your keys client side. This solves that, but when you have keyspaces using OOP but with different key distributions this falls apart. Not to say that this is a bad idea but it breaks the #1 Cassandra law of Cassandra keep everything balanced. That routine that calculates natural endpoints does not take the CF into account. Regarding multi-tenancy, I do not think there is a line in the sand between running N clusters and multi-tenancy. Multi-tenancy is also ambiguous like real time. Does multi-tenancy mean efficiently supporting 10-20 CFs or 20,000?. I do not see the cassandra code base supporting a very large number of cf's since it was designed around a low number of CFs! Some who may have who have moved from a RDBMS background where a table looks/works like a columnfamily. But if that is probably not denormalized enough. Many in fact advocate You only need 1 CF!
Re: Multiple Seeds
On Wed, Feb 23, 2011 at 2:30 PM, jeremy.truel...@barclayscapital.com wrote: Yeah I set the tokens, I’m more asking if I start the first seed node with autobootstrap set to false the second seed should have it set to true as well as all the slave nodes correct? I didn’t see this in the docs but I may have just missed it. From: Eric Gilmore [mailto:e...@datastax.com] Sent: Wednesday, February 23, 2011 2:24 PM To: user@cassandra.apache.org Subject: Re: Multiple Seeds The DataStax documentation offers some answers to those questions in the Getting Started section and the Clustering reference docs. Autobootstrap should be true, but with the important caveat that intial_token values should be specified. Have a look at those docs, and please give feedback on how helpful they are/aren't. Regards, Eric Gilmore On Wed, Feb 23, 2011 at 11:15 AM, jeremy.truel...@barclayscapital.com wrote: What’s the best way to bring multiple seeds up, should only one of them have auto bootstrap set to true or should neither of them? Should they list themselves and the other seed in their seed section in the yaml config? ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. ___ If a node is defined as a seeds it will never auto bootstrap. After it has bootstrapped and has a system table you can set its yaml file as a seed if you wish.
Re: Multiple Seeds
On Wed, Feb 23, 2011 at 2:59 PM, jeremy.truel...@barclayscapital.com wrote: To add a host to the seeds list after it has had the data streamed to it I need to 1. stop it 2. edit the yaml file to a. include it in the seeds list b. set auto boostrap to false 3. restart it correct? Additionally you would need to add it to the other nodes seed lists and restart them as well. From: Eric Gilmore [mailto:e...@datastax.com] Sent: Wednesday, February 23, 2011 2:47 PM To: user@cassandra.apache.org Subject: Re: Multiple Seeds Well -- when you first bring a node into a ring, you will probably want to stream data to it with auto_bootstrap: true. If you want that node to be a seed, then add it to the seeds list AFTER it has joined the ring. I'd refer you to the Seed List and Autoboostrapping sections of the Getting Started guide, which contain the following blurbs: There is no strict rule to determine which hosts need to be listed as seeds, but all nodes in a cluster need the same seed list. For a production deployment, DataStax recommends two seeds per data center. An autobootstrapping node cannot have itself in the list of seeds nor can it contain an initial_token already claimed by another node. To add new seeds, autobootstrap the nodes first, and then configure them as seeds. On Wed, Feb 23, 2011 at 11:39 AM, jeremy.truel...@barclayscapital.com wrote: So all seeds should always be set to 'auto_bootstrap: false' in their .yaml file. -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Wednesday, February 23, 2011 2:36 PM To: user@cassandra.apache.org Cc: Truelove, Jeremy: IT (NYK) Subject: Re: Multiple Seeds On Wed, Feb 23, 2011 at 2:30 PM, jeremy.truel...@barclayscapital.com wrote: Yeah I set the tokens, I'm more asking if I start the first seed node with autobootstrap set to false the second seed should have it set to true as well as all the slave nodes correct? I didn't see this in the docs but I may have just missed it. From: Eric Gilmore [mailto:e...@datastax.com] Sent: Wednesday, February 23, 2011 2:24 PM To: user@cassandra.apache.org Subject: Re: Multiple Seeds The DataStax documentation offers some answers to those questions in the Getting Started section and the Clustering reference docs. Autobootstrap should be true, but with the important caveat that intial_token values should be specified. Have a look at those docs, and please give feedback on how helpful they are/aren't. Regards, Eric Gilmore On Wed, Feb 23, 2011 at 11:15 AM, jeremy.truel...@barclayscapital.com wrote: What's the best way to bring multiple seeds up, should only one of them have auto bootstrap set to true or should neither of them? Should they list themselves and the other seed in their seed section in the yaml config? ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. ___ If a node is defined as a seeds it will never auto bootstrap. After it has bootstrapped and has a system table you can set its yaml file as a seed if you wish. ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author
Re: Multiple Seeds
On Wed, Feb 23, 2011 at 3:28 PM, jeremy.truel...@barclayscapital.com wrote: So does cassandra monitor the config file for changes? If it doesn't how else would it know unless you restart you had added a new seed? -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Wednesday, February 23, 2011 3:23 PM To: user@cassandra.apache.org Cc: Truelove, Jeremy: IT (NYK) Subject: Re: Multiple Seeds On Wed, Feb 23, 2011 at 2:59 PM, jeremy.truel...@barclayscapital.com wrote: To add a host to the seeds list after it has had the data streamed to it I need to 1. stop it 2. edit the yaml file to a. include it in the seeds list b. set auto boostrap to false 3. restart it correct? Additionally you would need to add it to the other nodes seed lists and restart them as well. From: Eric Gilmore [mailto:e...@datastax.com] Sent: Wednesday, February 23, 2011 2:47 PM To: user@cassandra.apache.org Subject: Re: Multiple Seeds Well -- when you first bring a node into a ring, you will probably want to stream data to it with auto_bootstrap: true. If you want that node to be a seed, then add it to the seeds list AFTER it has joined the ring. I'd refer you to the Seed List and Autoboostrapping sections of the Getting Started guide, which contain the following blurbs: There is no strict rule to determine which hosts need to be listed as seeds, but all nodes in a cluster need the same seed list. For a production deployment, DataStax recommends two seeds per data center. An autobootstrapping node cannot have itself in the list of seeds nor can it contain an initial_token already claimed by another node. To add new seeds, autobootstrap the nodes first, and then configure them as seeds. On Wed, Feb 23, 2011 at 11:39 AM, jeremy.truel...@barclayscapital.com wrote: So all seeds should always be set to 'auto_bootstrap: false' in their .yaml file. -Original Message- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Wednesday, February 23, 2011 2:36 PM To: user@cassandra.apache.org Cc: Truelove, Jeremy: IT (NYK) Subject: Re: Multiple Seeds On Wed, Feb 23, 2011 at 2:30 PM, jeremy.truel...@barclayscapital.com wrote: Yeah I set the tokens, I'm more asking if I start the first seed node with autobootstrap set to false the second seed should have it set to true as well as all the slave nodes correct? I didn't see this in the docs but I may have just missed it. From: Eric Gilmore [mailto:e...@datastax.com] Sent: Wednesday, February 23, 2011 2:24 PM To: user@cassandra.apache.org Subject: Re: Multiple Seeds The DataStax documentation offers some answers to those questions in the Getting Started section and the Clustering reference docs. Autobootstrap should be true, but with the important caveat that intial_token values should be specified. Have a look at those docs, and please give feedback on how helpful they are/aren't. Regards, Eric Gilmore On Wed, Feb 23, 2011 at 11:15 AM, jeremy.truel...@barclayscapital.com wrote: What's the best way to bring multiple seeds up, should only one of them have auto bootstrap set to true or should neither of them? Should they list themselves and the other seed in their seed section in the yaml config? ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. ___ If a node is defined as a seeds it will never auto bootstrap. After it has bootstrapped and has a system table you can set its yaml file as a seed if you wish. ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute
Re: Will the large datafile size affect the performance?
On Wed, Feb 23, 2011 at 4:51 PM, buddhasystem potek...@bnl.gov wrote: I know that theoretically it should not (apart from compaction issues), but maybe somebody has experience showing otherwise: My test cluster now has 250GB of data and will have 1.5TB in its reincarnation. If all these data is in a single CF -- will it cause read or write performance problems? Should I shard it? One advantage of splitting the data would be reducing the impact of compaction and repairs (or so I naively assume). TIA Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Will-the-large-datafile-size-affect-the-performance-tp6057991p6057991.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. http://wiki.apache.org/cassandra/LargeDataSetConsiderations By dividing your data you get the benefits of being able to apply two different settings at the Column Family or keyspace level. For example you might have some batch data that you only want to replicate twice, or some small subset of data that needs to be read frequently that is highly cached. Also as you said having three smaller CF's helps you avoid a single very long running and intensive operations like repair or major compact. If you always need to read both CF's to satisfy you application it is not a good idea.
Re: New Chain for : Does Cassandra use vector clocks
On Wed, Feb 23, 2011 at 9:28 PM, Ritesh Tijoriwala tijoriwala.rit...@gmail.com wrote: I was about to ask what Anthony's latest post below captures - if we don't have vector clocks and no locking, how does cassandra prevent/detect conflicts? This is somewhat related to the question I asked in last post - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-does-Cassandra-handle-failure-during-synchronous-writes-td6055152.html Thanks, Ritesh On Wed, Feb 23, 2011 at 6:22 PM, Anthony John chirayit...@gmail.com wrote: Apologies : For some reason my response on the original mail keeps bouncing back, thus this new one! From the other hand, the same article says: For conditional writes to work, the condition must be evaluated at all update sites before the write can be allowed to succeed. This means, that when doing such an update CL=ALL must be used Sorry, but I am confused by that entire thread! Questions:- 1. Does Cassandra implement any kind of data locking - at any granularity whether it be row/colF/Col ? 2. If the answer to 1 above is NO! - how does CL ALL prevent conflicts. Concurrent updates on exactly the same piece of data on different nodes can still mess each other up, right ? -JA Cassandra does not provide any build in locking. It can not protect from lost updates caused by multiple independent entities reading and writing the same data. The cages library handles locking externally and is really easy to use. http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
A simple script that creates multi node clusters on a single machine.
On the mailing list and IRC there are many questions about Cassandra internals. I understand where the questions are coming from because it took me a while to get a grip on it. However if you have a laptop with a descent amount of RAM 2 GB is enough for 3-5 nodes, (4GB is better). You can kick up a multi-node cluster right on your laptop. Then you can test failure/eventual consistent scenarios such as (insert to node A, kill node B, join node C) till your hearts content. http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/lauching_5_node_cassandra_clusters
Re: Fill disks more than 50%
On Wed, Feb 23, 2011 at 9:39 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Hi, Given that you have have always increasing key values (timestamps) and never delete and hardly ever overwrite data. If you want to minimize work on rebalancing and statically assign (new) token ranges to new nodes as you add them so they always get the latest data Lets say you add a new node each year to handle next years data. In a scenario like this, could you with 0.7 be able to safely fill disks significantly more than 50% and still manage things like repair/recovery of faulty nodes? Regards, Terje Since all your data for a day/month/year would sit on the same server. Meaning all your servers with old data would be idle and your servers with current data would be very busy. This is probably not a good way to go. There is a ticket open for 0.8 for efficient node moves joins. It is already a lot better in 0.7. Pretend you did not see this (you can join nodes using rsync if you know some tricks) if you are really afraid of joins, which you really should not be. As for the 50% statement. In a worse case scenario a major compaction will require double the disk size of your column family. So if you have more then 1 column family you do NOT need 50% overhead.
Re: Fill disks more than 50%
On Thu, Feb 24, 2011 at 4:08 AM, Thibaut Britz thibaut.br...@trendiction.com wrote: Hi, How would you use rsync instead of repair in case of a node failure? Rsync all files from the data directories from the adjacant nodes (which are part of the quorum group) and then run a compactation which will? remove all the unneeded keys? Thanks, Thibaut On Thu, Feb 24, 2011 at 4:22 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Wed, Feb 23, 2011 at 9:39 PM, Terje Marthinussen tmarthinus...@gmail.com wrote: Hi, Given that you have have always increasing key values (timestamps) and never delete and hardly ever overwrite data. If you want to minimize work on rebalancing and statically assign (new) token ranges to new nodes as you add them so they always get the latest data Lets say you add a new node each year to handle next years data. In a scenario like this, could you with 0.7 be able to safely fill disks significantly more than 50% and still manage things like repair/recovery of faulty nodes? Regards, Terje Since all your data for a day/month/year would sit on the same server. Meaning all your servers with old data would be idle and your servers with current data would be very busy. This is probably not a good way to go. There is a ticket open for 0.8 for efficient node moves joins. It is already a lot better in 0.7. Pretend you did not see this (you can join nodes using rsync if you know some tricks) if you are really afraid of joins, which you really should not be. As for the 50% statement. In a worse case scenario a major compaction will require double the disk size of your column family. So if you have more then 1 column family you do NOT need 50% overhead. @Thibaut Britz Caveat:Using simple strategy. This works because cassandra scans data at startup and then serves what it finds. For a join for example you can rsync all the data from the node below/to the right of where the new node is joining. Then join without bootstrap then cleanup both nodes. (also you have to shutdown the first node so you do not have a lost write scenario in the time between rsync and new node startup) It does not make as much sense for repair because the data on a node will tripple, before you compact/cleanup it. @Terje I am suggesting that your probably want to rethink your scheme design since partitioning by year is going to be bad performance since the old servers are going to be nothing more then expensive tape drives.
Re: New Chain for : Does Cassandra use vector clocks
On Thu, Feb 24, 2011 at 3:03 PM, A J s5a...@gmail.com wrote: yes, that is difficult to digest and one has to be sure if the use case can afford it. Some other NOSQL databases deals with it differently (though I don't think any of them use atomic 2-phase commit). MongoDB for example will ask you to read from the node you wrote first (primary node) unless you are ok with eventual consistency. If the write did not make to majority of other nodes, it will be rolled-back from the original primary when it comes up again as a secondary. In some cases, you still could server either new value (that was returned as failed) or the old one. But it is different from Cassandra in the sense that Cassandra will never rollback. On Thu, Feb 24, 2011 at 2:47 PM, Anthony John chirayit...@gmail.com wrote: The leap of faith here is that an error does not mean a clean backing out to prior state - as we are used to with databases. It means that the operation in error could have gone through partially Again, this is not an absolutely unfamiliar territory and can be dealt with. -JA On Thu, Feb 24, 2011 at 1:16 PM, A J s5a...@gmail.com wrote: but could be broken in case of a failed write You can think of a scenario where R + W N still leads to inconsistency even for successful writes. Say you keep W=1 and R=N . Lets say the one node where a write happened with success goes down before it made to the other N-1 nodes. Lets say it goes down for good and is unrecoverable. The only option is to build a new node from scratch from other active nodes. This will lead to a write that was lost and you will end up serving stale copy of it. It is better to talk in terms of use cases and if cassandra will be a fit for it. Otherwise unless you have W=R=N and fsync before each write commit, there will be scope for inconsistency. On Thu, Feb 24, 2011 at 1:25 PM, Anthony John chirayit...@gmail.com wrote: I see the point - apologies for putting everyone through this! It was just militating against my mental model. In summary, here is my take away - simple stuff but - IMO - important to conclude this thread (I hope):- 1. I was splitting hair over a failed ( partial ) Q Write. Such an event should be immediately followed by the same write going to a connection on to another node ( potentially using connection caches of client implementations ) or a Read at CL of All. Because a write could have partially gone through. 2. Timestamps are used in determining the latest version ( correcting the false impression I was propagating) Finally, wrt W + R N for Q CL statement holds, but could be broken in case of a failed write as it is unsure whether the new value got written on any server or not. Is that a fair characterization ? Bottom line - unlike traditional DBMS, errors do not ensure automatic cleanup and revert back, app code has to follow up if immediate - and not eventual - consistency is desired. I made that leap in almost all cases - I think - but the case of a failed write. My bad and I can live with this! Regards, -JA On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Thu, Feb 24, 2011 at 6:33 PM, Anthony John chirayit...@gmail.com wrote: Completely understand! All that I am quibbling over is whether a CL of quorum guarantees consistency or not. That is what the documentation says - right. IF for a CL of Q read - it depends on which node returns read first to determine the actual returned result or other more convoluted conditions , then a Quorum read/write is not consistent, by any definition. But that's the point. The definition of consistency we are talking about has no meaning if you consider only a quorum read. The definition (which is the de facto definition of consistency in 'eventually consistent') make sense if we talk about a write followed by a read. And it is considering succeeding write followed by succeeding read. And that is the statement the wiki is making. Honestly, we could debate forever on the definition of consistency and whatnot. Cassandra guaranties that if you do a (succeeding) write on W replica and then a (succeeding) read on R replica and if R+WN, then it is guaranteed that the read will see the preceding write. And this is what is called consistency in the context of eventual consistency (which is not the context of ACID). If this is not the definition of consistency you had in mind then by all mean, Cassandra probably don't guarantee this definition. But given that the paragraph preceding what you pasted state clearly we are not talking about ACID consistency, but eventual consistency, I don't think the wiki is making any unfair statement. That being said, the wiki may not be always as clear as it could. But it's an editable wiki :) -- Sylvain I can still use Cassandra, and will use it, luv
Re: Understanding Indexes
On Thu, Feb 24, 2011 at 3:34 PM, mcasandra mohitanch...@gmail.com wrote: I wasn't aware that there is an index on primary key (that is row keys). So from what I understand there is by default an index on for eg: , in below example? Where can I read more about it? UserProfile = { // this is a ColumnFamily { // this is the key to this Row inside the CF // now we have an infinite # of columns in this row username: phatduckk, email: [hidden email], phone: (900) 976- }, // end row { // this is the key to another row in the CF // now we have another infinite # of columns in this row username: ieure, email: [hidden email], phone: (888) 555-1212 age: 66, gender: undecided }, } -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061857.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. Dude! You are running before you can walk why are your worried about secondary indexing before you know what the primary index is? :) http://wiki.apache.org/cassandra/ArchitectureOverview http://wiki.apache.org/cassandra/ArchitectureSSTable
Re: Understanding Indexes
On Thu, Feb 24, 2011 at 3:55 PM, mcasandra mohitanch...@gmail.com wrote: Either I am not explaning properly or I don't understand the data model just yet. Please check again: In below example this is what I understand: 1) UserProfile is a CF 2) is a row key 3) username is a column. Each row (eg ) has username column My understanding is that secondary indexes can be created only on column value. Which means I can create secondary index only on username, email etc. not on . is the row key, but you keep saying that I need secondary index, but I am actually asking about index on the row key. Is my understanding incorrect about this? UserProfile = { // this is a ColumnFamily { // this is the key to this Row inside the CF // now we have an infinite # of columns in this row username: phatduckk, email: [hidden email], phone: (900) 976- }, // end row { // this is the key to another row in the CF // now we have another infinite # of columns in this row username: ieure, email: [hidden email], phone: (888) 555-1212 age: 66, gender: undecided }, } -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Understanding-Indexes-tp6058238p6061959.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. You do not need secondary indexes to search on the RowKey. The Row Key is used by the partitioner to locate your data across the cluster. The Row Key is also used as the primary sort of the SSTables. Thus the row key is naturally indexed.
Re: New Chain for : Does Cassandra use vector clocks
On Thu, Feb 24, 2011 at 3:56 PM, A J s5a...@gmail.com wrote: While we are at it, there's more to consider than just CAP in distributed :) http://voltdb.com/blog/clarifications-cap-theorem-and-data-related-errors On Thu, Feb 24, 2011 at 3:31 PM, Edward Capriolo edlinuxg...@gmail.com wrote: On Thu, Feb 24, 2011 at 3:03 PM, A J s5a...@gmail.com wrote: yes, that is difficult to digest and one has to be sure if the use case can afford it. Some other NOSQL databases deals with it differently (though I don't think any of them use atomic 2-phase commit). MongoDB for example will ask you to read from the node you wrote first (primary node) unless you are ok with eventual consistency. If the write did not make to majority of other nodes, it will be rolled-back from the original primary when it comes up again as a secondary. In some cases, you still could server either new value (that was returned as failed) or the old one. But it is different from Cassandra in the sense that Cassandra will never rollback. On Thu, Feb 24, 2011 at 2:47 PM, Anthony John chirayit...@gmail.com wrote: The leap of faith here is that an error does not mean a clean backing out to prior state - as we are used to with databases. It means that the operation in error could have gone through partially Again, this is not an absolutely unfamiliar territory and can be dealt with. -JA On Thu, Feb 24, 2011 at 1:16 PM, A J s5a...@gmail.com wrote: but could be broken in case of a failed write You can think of a scenario where R + W N still leads to inconsistency even for successful writes. Say you keep W=1 and R=N . Lets say the one node where a write happened with success goes down before it made to the other N-1 nodes. Lets say it goes down for good and is unrecoverable. The only option is to build a new node from scratch from other active nodes. This will lead to a write that was lost and you will end up serving stale copy of it. It is better to talk in terms of use cases and if cassandra will be a fit for it. Otherwise unless you have W=R=N and fsync before each write commit, there will be scope for inconsistency. On Thu, Feb 24, 2011 at 1:25 PM, Anthony John chirayit...@gmail.com wrote: I see the point - apologies for putting everyone through this! It was just militating against my mental model. In summary, here is my take away - simple stuff but - IMO - important to conclude this thread (I hope):- 1. I was splitting hair over a failed ( partial ) Q Write. Such an event should be immediately followed by the same write going to a connection on to another node ( potentially using connection caches of client implementations ) or a Read at CL of All. Because a write could have partially gone through. 2. Timestamps are used in determining the latest version ( correcting the false impression I was propagating) Finally, wrt W + R N for Q CL statement holds, but could be broken in case of a failed write as it is unsure whether the new value got written on any server or not. Is that a fair characterization ? Bottom line - unlike traditional DBMS, errors do not ensure automatic cleanup and revert back, app code has to follow up if immediate - and not eventual - consistency is desired. I made that leap in almost all cases - I think - but the case of a failed write. My bad and I can live with this! Regards, -JA On Thu, Feb 24, 2011 at 11:50 AM, Sylvain Lebresne sylv...@datastax.com wrote: On Thu, Feb 24, 2011 at 6:33 PM, Anthony John chirayit...@gmail.com wrote: Completely understand! All that I am quibbling over is whether a CL of quorum guarantees consistency or not. That is what the documentation says - right. IF for a CL of Q read - it depends on which node returns read first to determine the actual returned result or other more convoluted conditions , then a Quorum read/write is not consistent, by any definition. But that's the point. The definition of consistency we are talking about has no meaning if you consider only a quorum read. The definition (which is the de facto definition of consistency in 'eventually consistent') make sense if we talk about a write followed by a read. And it is considering succeeding write followed by succeeding read. And that is the statement the wiki is making. Honestly, we could debate forever on the definition of consistency and whatnot. Cassandra guaranties that if you do a (succeeding) write on W replica and then a (succeeding) read on R replica and if R+WN, then it is guaranteed that the read will see the preceding write. And this is what is called consistency in the context of eventual consistency (which is not the context of ACID). If this is not the definition of consistency you had in mind then by all mean, Cassandra probably don't guarantee this definition. But given that the paragraph preceding what you pasted state clearly
Re: Fill disks more than 50%
On Fri, Feb 25, 2011 at 7:38 AM, Terje Marthinussen tmarthinus...@gmail.com wrote: @Thibaut Britz Caveat:Using simple strategy. This works because cassandra scans data at startup and then serves what it finds. For a join for example you can rsync all the data from the node below/to the right of where the new node is joining. Then join without bootstrap then cleanup both nodes. (also you have to shutdown the first node so you do not have a lost write scenario in the time between rsync and new node startup) rsync all data from node to left/right.. Wouldn't that mean that you need 2x the data to recover...? Terje Terje, In your scenario where you are never updating running repair becomes less important. I have an alternative for you. I have a program I call the RescueRanger we use it to range-scan all our data, find old entries and then delete them. However if we set that program to read only mode and tell it to read at CL.ALL, It becomes a program that read repairs data! This is a tradeoff. Range scanning though all your data is not fast, but it does not require the extra disk space. Kinda like merge sort vs bubble sort.
Re: Storing photos, images, docs etc.
On Tue, Mar 1, 2011 at 1:43 PM, mcasandra mohitanch...@gmail.com wrote: Is it advisable or ok to store photos, images and docs in cassandra where you expect high volume of uploads and views? I was reading about facebook implementation of haystack to store the photos. They don't put anything in their mysql db. Since Cassandra is different from mysql I was wondering if it's ok or if there are going to be any issues. I tried searching online to read articles or papers on similar subject but couldn't find any where cassandra was being used to store docs/images etc. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6078278.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. Google of terms cassandra large files + feeling lucky http://www.google.com/search?q=cassandra+large+filesie=utf-8oe=utf-8aq=trls=org.mozilla:en-US:officialclient=firefox-a Yields: http://wiki.apache.org/cassandra/FAQ#large_file_and_blob_storage This is also nearly a bi-monthly mailing list topic.
Re: Storing photos, images, docs etc.
On Thu, Mar 3, 2011 at 2:49 PM, mcasandra mohitanch...@gmail.com wrote: Has anyone heard about lustre distributed file system? I am wondering if it will work well where keep the metadata in Cassandra and images in Lustre. I looked at MogileFS but not too sure about it's support. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Storing-photos-images-docs-etc-tp6078278p6086135.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. Luster and GlusterFS are cool but this is apples an oranges. Those are both mountable file systems with POSIX support.This is very different then a key value store.
Re: Poor performance on small data set
On Fri, Mar 11, 2011 at 11:44 AM, Peter Schuller peter.schul...@infidyne.com wrote: There is less than 1000 rows and i've got a 75-100ms to get one row by id With memcached it's 2ms I don't know where is the problem. jvm ? cassandra ? phpcassa ? What can i do to detect where is the problem ? I'm not familiar with the PHP client, but this sounds suspiciously like a nagle + delayed ACK problem. The PHP client probably isn't setting the TCP_NODELAY flag (or the equivalent in Windows). Google for nagle delayed ack for details. -- / Peter Schuller Also you will find that setting rowsCached and keysCached not effective. Chose one or the other. (that is not your problem but an FYI)
Re: Is column update column-atomic or row atomic?
On Tue, Mar 15, 2011 at 5:46 PM, buddhasystem potek...@bnl.gov wrote: Sorry for the rather primitive question, but it's not clear to me if I need to fetch the whole row, add a column as a dictionary entry and re-insert it if I want to expand the row by one column. Help will be appreciated. -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-column-update-column-atomic-or-row-atomic-tp6174445p6174445.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. No. In Cassandra you do not need to read to write. You should try to avoid it if possible.
Re: Please help decipher /proc/cpuinfo for optimal Cassandra config
On Wed, Mar 16, 2011 at 9:58 PM, buddhasystem potek...@bnl.gov wrote: Dear All, this is from my new Cassandra server. It obviously uses hyperthreading, I just don't know how to translate this to concurrent readers and writers in cassandra.yaml -- can somebody take a look and tell me what number of cores I need to assume for concurrent_reads and concurrent_writes. Is it 24? Thanks! [cassandra@cassandra01 bin]$ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz stepping : 2 cpu MHz : 1596.000 cache size : 12288 KB physical id : 0 siblings : 12 core id : 0 cpu cores : 6 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi flexpriority ept vpid bogomips : 5333.91 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz stepping : 2 cpu MHz : 1596.000 cache size : 12288 KB physical id : 0 siblings : 12 core id : 1 cpu cores : 6 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi flexpriority ept vpid bogomips : 5333.15 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz stepping : 2 cpu MHz : 1596.000 cache size : 12288 KB physical id : 0 siblings : 12 core id : 2 cpu cores : 6 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi flexpriority ept vpid bogomips : 5333.15 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz stepping : 2 cpu MHz : 1596.000 cache size : 12288 KB physical id : 0 siblings : 12 core id : 8 cpu cores : 6 apicid : 16 initial apicid : 16 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt aes lahf_lm arat tpr_shadow vnmi flexpriority ept vpid bogomips : 5333.15 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: processor : 4 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz stepping : 2 cpu MHz : 1596.000 cache size : 12288 KB physical id : 0 siblings : 12 core id : 9 cpu cores : 6 apicid : 18 initial apicid : 18 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36
Re: Replacing a dead seed
On Thu, Mar 17, 2011 at 9:09 AM, Jonathan Colby jonathan.co...@gmail.com wrote: Hi - If a seed crashes (i.e., suddenly unavailable due to HW problem), what is the best way to replace the seed in the cluster? I've read that you should not bootstrap a seed. Therefore I came up with this procedure, but it seems pretty complicated. any better ideas? 1. update the seed list on all nodes, taking out the dead node and restart the nodes in the cluster so the new seed list is updated 2. then bootstrap the new (replacement ) node as a normal node (not yet as a seed) 3. when bootstrapping is done, make the new node a seed. 4. update the seed list again adding back the replacement seed (and rolling restart the cluster as in step 1) That seems to me like a whole lot of work. Surely there is a better way? Jon It is true that Seeds do not auto bootstrap. But in this case it does not matter if the other nodes believe this node is a seed. It only matters what the joining node is configured to believe. On the joining node do not include it's hostname/IP in the seed list and it should auto-bootstrap normally.
Re: Optimizing a few nodes to handle all client connections?
On Fri, Mar 18, 2011 at 9:55 PM, Jason Harvey alie...@gmail.com wrote: Hola everyone, I have been considering making a few nodes only manage 1 token and entirely dedicating them to talking to clients. My reasoning behind this is I don't like the idea of a node having a dual-duty of handling data, and talking to all of the client stuff. Is there any merit to this thought? Cheers, Jason Technically possible but not recommended. Beside making this node a single point of failure, you assuredly add more latency to every request. Also each request has memory overhead, one node will have the sum overhead of all the requests it is not scalable. Also this node can become a bandwidth limit. One of the reasons to chose cassandra is it does NOT have a master/queen node that all requests are proxied through.
Re: Working backwards from production to staging/dev
On Fri, Mar 25, 2011 at 2:11 PM, ian douglas i...@armorgames.com wrote: On 03/25/2011 10:12 AM, Jonathan Ellis wrote: On Fri, Mar 25, 2011 at 11:59 AM, ian douglasi...@armorgames.com wrote: (we're running v0.60) I don't know if you could hear that from where you are, but our whole office just yelled, WTF! :) Ah, that's what that noise was... And yeah, we know we're way behind. Our initial delay in upgrading was waiting for 0.7 to come out and then we learned we needed a whole new Thrift client for our PHP code base, and then we got busy on other things, but we're at a point where we have some time to take care of Cassandra and get it upgraded. Our planned path, now, is: (our nodes' tokens are numbered using the python code (0, 1/3 and 2/3 times 2^127), and called node 1 through 3, respectively; our RF is set to 2 right now) 1. remove node 1 from our software 2. bring node 1 offline after a flush/repair/cleanup 3. run a cleanup on node 2 and then on node 3 so they have a full copy of all data from the old node 1 and each other. 4. bring up a new Large 64-bit instance, install 0.6.12, assign a Token value of 0 (node 1), RF:2, on a new gossip ring, and copy all data from the 32-bit nodes 2 and 3 and run a repair/cleanup to remove any duplicated data 5. remove node 3 from our software 6. point our code to the new 64-bit node 1 7. bring node 3 offline after a flush/repair/cleanup so node 2 has the last fresh copy of everything 8. bring node 2 offline after a flush/repair/cleanup 9. bring up another Large instance, get a copy of all data from our old node 2, assign a Token value of (1/2 * 2^127), RF:2, on the new gossip ring, run a repair to remove duplicate data, and then a cleanup so it gets replicated data from the new node 1 10. add the new node 2 to our software 11. run a final cleanup on the new node 1 and then on node 2 to make sure all data is replicated evenly on both nodes ... at this point, we should have two 64-bit Large instances, with RF:2, on a new gossip ring, replacing three 32-bit systems, with minimal down time and no data loss (just a data delay between steps 6 and 10 above). Questions: 1. Does it appear that we've missed any steps, or doing something out of order? 2. Is the flush/repair/cleanup overkill when bringing the old nodes offline, or is that the correct sequence to follow? 3. Will the difference in compute units (lower on Large instances than Medium instances) make any noticeable difference, or will the fact that the machine is 64-bit handle things efficiently enough such that a Large instance works harder than a Medium instance? (never did figure out their how their compute units work) 4. Can we follow similar steps when we're ready to upgrade to 0.7x and have our new Thrift client for PHP all squared away? Thanks again for the help!!! If you have a node with an old column family you are not using anymore...Stop node...delete data...start node. Edward
Re: Starter GUI Tool for Windows
I don't know. Apache web server is a patchy web server, but crapsandra just no way to put that in a good light. On Friday, March 25, 2011, Dario Bravo darbr...@gmail.com wrote: People: Crapssandra. I'm starting a Cassandra project and starting to learn about this beautiful Cassandra, so I thougth that it would be nice to have a db gui tool under my current OS. It doesn't do anything other than showing some info about the server or the selected keyspace... but I hope it'll do many things such as manage keyspaces, column families, columns and super columns, show data contained on columns, allow to perform queries (get, set, mostly), etc. If anyone wishes to help in any way, please feel free to download the code and modify it. It's called Crapssandra because it started as a crappy simple code and it's features are gonna be developed as I need them... so it will have crappy code, mostly. It's done using .net 3.5 and Thrift. The address to download it and it's source code is: http://code.google.com/p/crapssandra/ http://code.google.com/p/crapssandra/Hope this helps someone, that the app grow as I wish, and to get some help from the community. Thanks! -- Darío Bravo
Re: Starter GUI Tool for Windows
On Sun, Mar 27, 2011 at 10:56 AM, Dario Bravo darbr...@gmail.com wrote: I'm adding new features today. You can now download it and will be able to view keyspaces info and column families. I will start to develop a feature to add column families to keyspaces... it will take some time, but you can play around with it (for almost a minute, before you get bored). 2011/3/26 Dario Bravo darbr...@gmail.com hehe, okay, maybe I'd chosen a bad name... does anybody think a better one? If you check out the source, it can do a few new things, such as drop keyspaces (except system), and show info on selected nodes... Tomorrow I'll be adding a bunch of new features, I hope. 2011/3/26 Edward Capriolo edlinuxg...@gmail.com I don't know. Apache web server is a patchy web server, but crapsandra just no way to put that in a good light. On Friday, March 25, 2011, Dario Bravo darbr...@gmail.com wrote: People: Crapssandra. I'm starting a Cassandra project and starting to learn about this beautiful Cassandra, so I thougth that it would be nice to have a db gui tool under my current OS. It doesn't do anything other than showing some info about the server or the selected keyspace... but I hope it'll do many things such as manage keyspaces, column families, columns and super columns, show data contained on columns, allow to perform queries (get, set, mostly), etc. If anyone wishes to help in any way, please feel free to download the code and modify it. It's called Crapssandra because it started as a crappy simple code and it's features are gonna be developed as I need them... so it will have crappy code, mostly. It's done using .net 3.5 and Thrift. The address to download it and it's source code is: http://code.google.com/p/crapssandra/ http://code.google.com/p/crapssandra/Hope this helps someone, that the app grow as I wish, and to get some help from the community. Thanks! -- Darío Bravo -- Darío Bravo -- Darío Bravo There is @client-dev list that is perfect for these threads.
Re: International language implementations
On Tue, Mar 29, 2011 at 5:54 PM, A J s5a...@gmail.com wrote: Example, taobao.com is a chinese online bid site. All data is chinese and they use Mongodb successfully. Are there similar installations of cassandra where data is non-latin ? I know in theory, it should all work as cassandra has full utf-8 support. But unless there are real implementations, you cannot be sure of the issues related to storing,sorting etc.. Regards. On Tue, Mar 29, 2011 at 5:41 PM, Peter Schuller peter.schul...@infidyne.com wrote: Can someone list some of the current international language implementations of cassandra ? What is an international language implementation of Cassandra? -- / Peter Schuller Keyspace -Java String ColumnFamily -Java string Row Key- byte [] column - byte [] value - byte [] So you can encode/store any type of data you like. As for internationalization, I have not found any NadaSQL groups yet.
Re: How to determine if repair need to be run
On Wed, Mar 30, 2011 at 12:54 PM, Peter Schuller peter.schul...@infidyne.com wrote: Note this script doesn't work if your repair takes hours, and in the middle of the repair cassandra was restarted, nodetool will exit and the flagfile will be updated. Another case, if repair hangs, and day later cassandra is restarted. This is why set -e is at the to and commented as important :) But it relies on 'nodetool repair' reliably exiting with non-zero exit status on failures. if nodetool returns an error this might work: nodetool -h localhost repair touch /path/to/flagfile.tmp That's the equivalent, due to 'set -e'. -- / Peter Schuller I just wanted to chime in here and say some people NEVER run repair. In our particular case we remove inactive data older then a specific date. If we lost a tombstone and that data were to re-appear that would really not be the end of the world for us. Repair is really intensive since it involves a compaction and in 0.6.X was not optimal as it really increased on disk data. I have followed some threads and there are some conditions that I read repair can't handle. The question you have to ask yourself is how likely are they to occur and what they might mean in your use-case. These are not easy questions to answer.
Re: Two column families or One super column family?
On Thu, Mar 31, 2011 at 3:52 AM, T Akhayo t.akh...@gmail.com wrote: Hi Aaron, Thank you for your reply, i appreciate the suggestions you made. Yesterday i managed to get everything (our main read) in one CF, with the use of a structure in a value like you suggested. Designing a new data model is different from what i'm used to, but if you keep in mind that you designing for performance instead of flexibility then everything gets a bit easier. Kind regards, T. Akhayo 2011/3/30 aaron morton aa...@thelastpickle.com I would go with the solution that means you only have to make one request to serve your reads, so consider the super CF approach. There are some downsides to super columns see http://wiki.apache.org/cassandra/CassandraLimitations and they tend to have a love-them-hate-them reputation. One thing to consider is that you do not need to model every attribute of your entity as a column in cassandra. Especially if you are always going to pull back all the attributes. So you could do your super CF approach with a standard CF, just pack the columns into some sort of structure such as JSON and store them as a blob. Or you can use a naming scheme in the column names with a standard CF, e.g. uuid1.text and uuid2.text Hope that helps. Aaron On 30 Mar 2011, at 01:05, T Akhayo wrote: Good afternoon, I'm making my data model from scratch for cassandra, this means i can tune and fine tune it for performance. At this time i'm having problems choosing between a 2 column families or 1 super column family. I will illustrate with a example. Sector, this defines a place, this is one or two properties. Entry, a entry that is bound to a sector, this is simply some text and a few properties. I can model this with a super column family: sectors{ //super column family sector1{ uid1{ text: a text user: joop } uid2{ text: more text user: piet } } sector2{ uid10{ text: even more text user: marie } } } But i can also model this with 2 column families: sectors{ // column family sector1{ textid1: null textid2: null } sector2{ textid4: null } } texts{ //column family textid1{ text: a text user: joop } textid2{ text: more text user: piet } } With the super column family i can retrieve a list of texts for a specific sector with only 1 request to cassandra. With the 2 column families i need to send 2 requests to cassandra: 1. give me all textids from sector x. (returns x, y, z) 2. give me all texts that have id x, y, z. In my final application it is likely that there will be a bit more writes compared to reads. I was wondering what the best approach is when it comes to performance. I suspect that using super column families is slower compared the using column families, but is it stil slower when using 2 column families and with 2 request to cassandra instead of 1 (with super column family). Kind regards, T. Akhayo I decided to write this as a general guide to the topic of denormalizing things into multiple CF's or not. http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/whytf_would_i_need_with
Re: Not able to set ZERO consistency level
On Thu, Mar 31, 2011 at 2:53 PM, Peter Schuller peter.schul...@infidyne.com wrote: Only the following Levels are provided, I am wondering if the ZERO consistency level is removed in Cassandra 0.7.X ? Yes, it's gone. If so, Could you please explain why was it removed and what is the best option I have given my context. https://issues.apache.org/jira/browse/CASSANDRA-1607 Are you *sure* you want it? :) -- / Peter Schuller ANY would be the next step up. Beware though of the eventually consistent boogie man!
Re: Node added, no performance boost -- are the tokens correct?
On Thu, Mar 31, 2011 at 6:15 PM, Eric Gilmore e...@datastax.com wrote: A script that I have says the following: $ python ctokens.py How many nodes are in your cluster? 2 node 0: 0 node 1: 85070591730234615865843651857942052864 The first token should be zero, for the reasons discussed here: http://www.datastax.com/dev/tutorials/getting_started_0_7/configuring#initial-token-values More details are available in http://www.datastax.com/docs/0.7/operations/clustering#adding-capacity The DS docs have some weak areas, but these two pages have been pretty well vetted over the past months :) On Thu, Mar 31, 2011 at 3:06 PM, buddhasystem potek...@bnl.gov wrote: I just configured a cluster of two nodes -- do these token values make sense? The reason I'm asking that so far I don't see load balancing to be happening, judging from performance. Address Status State Load Owns Token 170141183460469231731687303715884105728 130.199.185.194 Up Normal 153.52 GB 50.00% 85070591730234615865843651857942052864 130.199.185.193 Up Normal 199.82 GB 50.00% 170141183460469231731687303715884105728 -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Node-added-no-performance-boost-are-the-tokens-correct-tp6228872p6228872.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. The first token does not really have to be zero. They just have been spread evenly across the token space.
Re: Ditching Cassandra
Gregori, Congrats on writing the fud-liest post of the month award. Firstly if you don't like updates give up on computers and software. Especally give up on anything that has to do with nosql because it is fast evolving. If you think you have a problem with the cassandra api, then what you really have a problem with the data model. You should have done more research nine months ago. I can not understand from rant exactly what you think is better about the mongo api. I see the complaint lots of code I suggest books on design patterns. It is hardly the fault of cassandra that it works with so many languages and people create higher level clients and abstractions for it. I believe it is a testament to cassandra that many places that are historically non java shops can pick up ruby or php clients and dive in. Also I do not see exactly what is so hard about the api thrift generates. To me it looks like the memcache api except the value is a map. I do not see what needs to be wrapped around it to make it easier... Maybe a factory method to one liner things? On Wednesday, March 30, 2011, Ashlee Saunders ashlee.saund...@aswebco.com.au wrote: Thanks for the feedback Grgori, We in Australia are only concerned with solutions as we are a solutions focused organization. With respect to your feedback, you and your team seem to have identified no solutions other than jumping ship. When we subscribed to the 50 or so emails per day, we wanted to contribute solutions to the Cassandra community rather than dwell on problems. I have enjoyed following the team on this project, and they have been very solutions focused. Please refrain from contributing negatively. Find solutions to the Cassandra project. To the rest, please keep up the great work. Ashlee Saunders On 31/03/2011, at 7:19 AM, Ed Anuff e...@anuff.com wrote: My concern when I see something like this is it might cause developers on the project to get worried and start to try to solve the wrong problems. Cassandra is not going to be as easy as Mongo, certainly not any time soon. CQL won't do it, although it will help. This isn't a criticism of Cassandra or CQL though. Cassandra isn't here to compete with Mongo on ease of use, it's here to compete on scalability. Secondly, the client libraries are not a mess. Some might be, some are not - Hector, which is the one I contribute to, is pretty good. Client libraries aren't going away. People are still building client libraries on top of SQL four decades later, we just call them ORM or middleware. Cassandra's data model is by necessity somewhat complicated, and most of the client libraries are going to have to be more than wrappers around Thrift or easy ways to send CQL. There's where Hector is going, it has a lightweight JPA implementation and it's going to have a very robust implementation soon. Honestly, the only criticism by the OP that should be taken to heart is stability. Cassandra can be the hardest database in the world to use and still succeed, but it has to be rock solid at all levels of scale, and that has to be the focus in the near term. On Tue, Mar 29, 2011 at 5:11 PM, Gregori Schmidt grokd...@gmail.com wrote: hi, After using Cassandra during development for the past 8 months my team and I made the decision to switch from Cassandra to MongoDB this morning. I thought I'd share some thoughts on why we did this and where Cassandra might benefit from improvement.
Re: nodetool cfstathistogram error
On Thu, Mar 31, 2011 at 8:25 PM, mcasandra mohitanch...@gmail.com wrote: It looks like if I use system schema it fails. Is it because of LocalPartitioner? I ran with other keyspace and got following output. Offset SSTables Write Latency Read Latency Row Size Column Count 1 0 0 0 0 0 2 0 0 0 0 0 179 0 0 0 320 320 Can someone please help me understand the output in first 2 columns? Why are SSTables always 0? I am writing shell/awk scripts to parse this data and send it out to monitoring tool. So far I am planning to monitor output of netstat, tpstat and cfhistograms. Is there anything else I should monitor that might be helpful? -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/nodetool-cfstathistogram-error-tp6228995p6229038.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. The system schema would does not work, and probably would not produce any interesting output if it did.
Re: Node added, no performance boost -- are the tokens correct?
On Fri, Apr 1, 2011 at 1:15 PM, Peter Schuller peter.schul...@infidyne.com wrote: Now, I moved the tokens. I still observe that read latency deteriorated with 3 machines vs original one. Replication factor is 1, Cassandra version 0.7.2 (didn't have time to upgrade as I need results by this weekend). Read *latency* is fully expected to increase if you just add a node. *Throughput* should increase, unless you have a workload that manages to be more expensive on RPC than actual reads/writes. Latency would only be improved by additional nodes under some significant load. How are you benchmarking? Are you concurrently submitting requests to all nodes at the same time? Try using stress.py from the Cassandra tree as a comparison. If you're sending one request at a time, there is no expectation at all of a performance improvement - just a decrease in performance. -- / Peter Schuller To be clear on this issue. It does not matter where the tokens start it only matters that they are equally spaced around the token space. So for a 4 node clusters your tokens should either be 1 * ((2^127) / 4) = 42535295865117307932921825928971026432 2 * ((2^127) / 4) = 85070591730234615865843651857942052864 3 * ((2^127) / 4) = 127605887595351923798765477786913079296 4 * ((2^127) / 4) = 170141183460469231731687303715884105728 or 0 * ((2^127) / 4) = 0 1 * ((2^127) / 4) = 42535295865117307932921825928971026432 2 * ((2^127) / 4) = 85070591730234615865843651857942052864 3 * ((2^127) / 4) = 127605887595351923798765477786913079296 If you move one you have to move the rest because the distance between 170141183460469231731687303715884105728 and 0 is 1
Re: Bizarre side-effect of increasing read concurrency
On Fri, Apr 1, 2011 at 11:27 PM, Jason Harvey alie...@gmail.com wrote: On further analysis, it looks like this behavior occurs when a node is simply restarted. Is that normal behavior? If mark-and-sweep becomes less and less effective over time, does that suggest an issue with GC, or an issue with memory use? On Apr 1, 8:21 pm, Jason Harvey alie...@gmail.com wrote: After increasing read concurrency from 8 to 64, GC mark-and-sweep was suddenly able to reclaim much more memory than it previously did. Previously, mark-and-sweep would run around 5.5GB, and would cut heap usage to 4GB. Now, it still runs at 5.5GB, but it shrinks all the way down to 2GB used. This behavior was consistent in every machine I increased read-concurrent on. Any thoughts on why this behavior changed? No other diagnostics appeared to correlate to the concurrency change, besides thread count. Jason, First you do not need to restart to adjust concurrent readers. It can be done from JMX without restart. As for the memory, after you restart you may have drained your caches and memtables which explains why less memory is used. Java also enjoys using all the memory your allocate and the Garbage collection does not give it back unless it needs to. Edward
Re: Endless minor compactions after heavy inserts
On Sun, Apr 3, 2011 at 1:46 PM, Sheng Chen chensheng2...@gmail.com wrote: I think if i can keep a single sstable file in a proper size, the hot data/index files may be able to fit into memory at least in some occasions. In my use case, I want to use cassandra for storage of a large amount of log data. There will be multiple nodes, and each node has 10*2TB disks to hold as much data as possible, ideally 20TB (about 100 billion rows) in one node. Reading operations will be much less than writing. A reading latency within 1 second is acceptable. Is it possible? Do you have advice on this design? Thank you. Sheng 2011/4/3 aaron morton aa...@thelastpickle.com With only one data file your reads would use the least amount of IO to find the data. Most people have multiple nodes and probably fewer disks, so each node may have a TB or two of data. How much capacity do your 10 disks give ? Will you be running multiple nodes in production ? Aaron On 2 Apr 2011, at 12:45, Sheng Chen wrote: Thank you very much. The major compaction will merge everything into one big file., which would be very large. Is there any way to control the number or size of files created by major compaction? Or, is there a recommended number or size of files for cassandra to handle? Thanks. I see the trigger of my minor compaction is OperationsInMillions. It is a number of operations in total, which I thought was in a second. Cheers, Sheng 2011/4/1 aaron morton aa...@thelastpickle.com If you are doing some sort of bulk load you can disable minor compactions by setting the min_compaction_threshold and max_compaction_threshold to 0 . Then once your insert is complete run a major compaction via nodetool before turning the minor compaction back on. You can also reduce the compaction threads priority, see compaction_thread_priority in the yaml file. The memtable will be flushed when either the MB or ops throughput is triggered. If you are seeing a lot of memtables smaller than the MB threshold then the ops threshold is probably been triggered. Look for a log message at INFO level starting with Enqueuing flush of Memtable that will tell you how many bytes and ops the memtable had when it was flushed. Trying increasing the ops threshold and see what happens. You're change in the compaction threshold may not have an an effect because the compaction process was already running. AFAIK the best way to get the best out of your 10 disks will be to use a dedicated mirror for the commit log and a stripe set for the data. Hope that helps. Aaron On 1 Apr 2011, at 14:52, Sheng Chen wrote: I've got a single node of cassandra 0.7.4, and I used the java stress tool to insert about 100 million records. The inserts took about 6 hours (45k inserts/sec) but the following minor compactions last for 2 days and the pending compaction jobs are still increasing. From jconsole I can read the MemtableThroughputInMB=1499, MemtableOperationsInMillions=7.0 But in my data directory, I got hundreds of 438MB data files, which should be the cause of the minor compactions. I tried to set compaction threshold by nodetool, but it didn't seem to take effects (no change in pending compaction tasks). After restarting the node, my setting is lost. I want to distribute the read load in my disks (10 disks in xfs, LVM), so I don't want to do a major compaction. So, what can I do to keep the sstable file in a reasonable size, or to make the minor compactions faster? Thank you in advance. Sheng Consider implications of http://wiki.apache.org/cassandra/LargeDataSetConsiderations
Re: Embedding Cassandra in Java code w/o using ports
On Mon, Apr 4, 2011 at 8:29 AM, aaron morton aa...@thelastpickle.com wrote: I'm interested to know more about the problems using the CLI. Aaron. On 2 Apr 2011, at 15:07, Bob Futrelle wrote: Connecting via CLI to local host with a port number has never been successful for me in Snow Leopard. No amount of reading suggestions and varying the approach has worked. So I'm going to talk to Cassandra via its API, from Java. But I noticed that in some code samples that call the API from Java, ports are also in play. In using Derby in Java I've never had to designate any ports. Is such a strategy available with Cassandra? - Bob Futrelle Northeastern U. I realize you do not want to open ports at all. One thing I do is leverage the private loop back addresses that are on each computer 127.0.0.1,127.0.0.2-127.255.255.254.
Re: selecting random columns ..
On Fri, Apr 8, 2011 at 4:48 AM, Sasha Dolgy sdo...@gmail.com wrote: hi all, is there a way to select random columns from a key? -- Sasha Dolgy sasha.do...@gmail.com getRangeSlice with random column start key.
Re: database design
On Wed, Apr 13, 2011 at 10:39 AM, Jean-Yves LEBLEU jleb...@gmail.com wrote: Hi all, Just some thoughts and question I have about cassandra data modeling. If I understand well, cassandra is better on writing than on reading. So you have to think about your queries to design cassandra schema. We are doing incremental design, and already have our system in production and we have to develop new queries. How do you usualy do when you have new queries, do you write a specific job to update data in the database to match the new query you are writing ? Thanks for your help. Jean-Yves Good point, Generally you will need to write some type of range scanning/map reduce application to process and back fill your data.
Re: Quick Poll: Server names
On Tue, Jul 27, 2010 at 11:49 AM, uncle mantis uncleman...@gmail.com wrote: Ah S**T! The Pooh server is is down again! =) What does one do if they run out of themed names? Regards, Michael On Tue, Jul 27, 2010 at 10:46 AM, Brett Thomas brettptho...@gmail.com wrote: I like names of colleges On Tue, Jul 27, 2010 at 11:40 AM, Dave Viner davevi...@pobox.com wrote: I've seen used several... names of children of employees of the company names of streets near office names of diseases (lead to very hard to spell names after a while, but was quite educational for most developers) names of characters from famous books (e.g., lord of the rings, asimov novels, etc) On Tue, Jul 27, 2010 at 7:54 AM, uncle mantis uncleman...@gmail.com wrote: I will be naming my servers after insect family names. What do you all use for yours? If this is something that is too off topic please contact a moderator. Regards, Michael I know this is a fun thread, and I hate being a debby downer but...In my opinion, naming servers after anything then their function is not a great idea. Lets look at some positives and negatives: System1: cassandra01 cassandra02 cassandra03 VS System2: tom dick harry Forward and reverse DNS: System1 is easy to mange with the server number you can easily figure out an offset. System2 requires careful mapping and will be more error prone. The future: So way back when a company i was at used Native American tribe names. Guess what happened. At about 20 nodes we ran out of common names like Cherokee, and we had servers named choctaw. These names become hard to spell and hard to say. Once you run out of native American names and you start using 'country names' What is the point? It is not even a convention any more. Cassandra servers are named after Native Americans, or possible food, or possibly a dog. Quick someone... fido just went down? What does fido do? Is it important? Is it in our web cluster or are cassandra cluster? Someone about mentioned Chevron1 till Chevron9. Look then ran out of unique names after the 5th server. So essentially 5 unique fun names then chevron6-1000. Why is chevron6-1000 better then cassandra6-1000 and is it any more fun? Reboots: Have you ever called a data center at 1AM for a server reboot? Picking a fancy, non phonetic name is a great way for a tired NOC operator to reboot the wrong one.
Re: how to recover cassandra data
On Mon, Aug 2, 2010 at 9:11 AM, john xie shanfengg...@gmail.com wrote: ReplicationFactor = 3 one day i stop 192.168.1.147 and remove cassandra data by mistake, can i recover 192.168.1.147's cassadra data by restart cassandra ? DataFileDirectories DataFileDirectory/data1/cassandra//DataFileDirectory DataFileDirectory/data2/cassandra//DataFileDirectory DataFileDirectory/data3/cassandra//DataFileDirectory /DataFileDirectories /data3 mount /dev/sdd i remove /data3 and formatt /dev/sdd Address Status Load Range Ring 135438270110006521520577363629178401179 192.168.1.148 Up 50.38 GB 5243502939295338512484974245382898 |--| 192.168.1.145 Up 48.38 GB 63161078970569359253391371326773726097 | | 192.168.1.147 ? 23.5 GB 79546317728707787532885001681404757282 | | 192.168.1.146 Up 26.34 GB 135438270110006521520577363629178401179 |--| Since you have a replication factor of three if you bring a new node through auto-bootstrap data will migrate back to it since it has two copies. Nothing is lost.
Re: unable to start cassandra
On Tue, Aug 3, 2010 at 10:47 AM, Maciej Lisowski m.lisow...@powerprice.pl wrote: Hi all, I’m new here and new with Cassandra and I’ve got problem to run it (v. 0.6.4) with jdk1.6.0_21. When I type “cassandra” to run it I get error: ERROR 16:23:53,803 Uncaught exception in thread Thread[ROW-MUTATION-STAGE:5,5,main] java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.NullPointerException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) ... 2 more Caused by: java.lang.NullPointerException at org.apache.cassandra.db.Table$TableMetadata.getColumnFamilyId(Table.java:131) at org.apache.cassandra.db.Table.getColumnFamilyId(Table.java:364) at org.apache.cassandra.db.commitlog.CommitLog$4.runMayThrow(CommitLog.java:256) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 6 more I was looking for info what could happen but I didn’t find… can someone help me with this? If I have to send something more (configuration or whatever) let me know Maciek Something similar happened to me when upgrading from 6.1 - 6.2. Even though the on-disk format of the SSTABLES is the same, sometimes the wire-format serialization of Future Tasks change. If that is the case, it means that the upgrade can NOT be done with a rolling restart. I am not sure this is the case here but that might help. Edward
Re: unable to start cassandra
On Tue, Aug 3, 2010 at 11:44 AM, Edward Capriolo edlinuxg...@gmail.com wrote: On Tue, Aug 3, 2010 at 10:47 AM, Maciej Lisowski m.lisow...@powerprice.pl wrote: Hi all, I’m new here and new with Cassandra and I’ve got problem to run it (v. 0.6.4) with jdk1.6.0_21. When I type “cassandra” to run it I get error: ERROR 16:23:53,803 Uncaught exception in thread Thread[ROW-MUTATION-STAGE:5,5,main] java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.lang.NullPointerException at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.afterExecute(DebuggableThreadPoolExecutor.java:86) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:888) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) ... 2 more Caused by: java.lang.NullPointerException at org.apache.cassandra.db.Table$TableMetadata.getColumnFamilyId(Table.java:131) at org.apache.cassandra.db.Table.getColumnFamilyId(Table.java:364) at org.apache.cassandra.db.commitlog.CommitLog$4.runMayThrow(CommitLog.java:256) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) ... 6 more I was looking for info what could happen but I didn’t find… can someone help me with this? If I have to send something more (configuration or whatever) let me know Maciek Something similar happened to me when upgrading from 6.1 - 6.2. Even though the on-disk format of the SSTABLES is the same, sometimes the wire-format serialization of Future Tasks change. If that is the case, it means that the upgrade can NOT be done with a rolling restart. I am not sure this is the case here but that might help. Edward Sorry. Mis-read on my part. It does not look like you are doing an upgrade. Dis-regard my comments.
Growing commit log directory.
I have a 16 node 6.3 cluster and two nodes from my cluster are giving me major headaches. 10.71.71.56 Up 58.19 GB 10827166220211678382926910108067277| ^ 10.71.71.61 Down 67.77 GB 123739042516704895804863493611552076888v | 10.71.71.66 Up 43.51 GB 127605887595351923798765477786913079296| ^ 10.71.71.59 Down 90.22 GB 139206422831293007780471430312996086499v | 10.71.71.65 Up 22.97 GB 148873535527910577765226390751398592512| ^ The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB + commit log directories. They keep growing, along with memory usage, eventually the logs start showing GCInspection errors and then the nodes will go OOM INFO 14:20:01,296 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1281378001296.log INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving 7955651792 used; max is 9773776896 INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving 8137412920 used; max is 9773776896 INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving 8310139720 used; max is 9773776896 INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving 8480136592 used; max is 9773776896 INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving 8648872520 used; max is 9773776896 INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving 8816581312 used; max is 9773776896 INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving 8986063136 used; max is 9773776896 INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving 9153134392 used; max is 9773776896 INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving 9318140296 used; max is 9773776896 java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid10913.hprof ... INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead. INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead. INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200 reclaimed leaving 9334753480 used; max is 9773776896 INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead. Heap dump file created [12730501093 bytes in 253.445 secs] ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880 reclaimed leaving 9335215296 used; max is 9773776896 Does anyone have any ideas what is going on?
Re: Growing commit log directory.
On Mon, Aug 9, 2010 at 8:20 PM, Jonathan Ellis jbel...@gmail.com wrote: what does tpstats or other JMX monitoring of the o.a.c.concurrent stages show? On Mon, Aug 9, 2010 at 4:50 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I have a 16 node 6.3 cluster and two nodes from my cluster are giving me major headaches. 10.71.71.56 Up 58.19 GB 10827166220211678382926910108067277 | ^ 10.71.71.61 Down 67.77 GB 123739042516704895804863493611552076888 v | 10.71.71.66 Up 43.51 GB 127605887595351923798765477786913079296 | ^ 10.71.71.59 Down 90.22 GB 139206422831293007780471430312996086499 v | 10.71.71.65 Up 22.97 GB 148873535527910577765226390751398592512 | ^ The symptoms I am seeing are nodes 61 and nodes 59 have huge 6 GB + commit log directories. They keep growing, along with memory usage, eventually the logs start showing GCInspection errors and then the nodes will go OOM INFO 14:20:01,296 Creating new commitlog segment /var/lib/cassandra/commitlog/CommitLog-1281378001296.log INFO 14:20:02,199 GC for ParNew: 327 ms, 57545496 reclaimed leaving 7955651792 used; max is 9773776896 INFO 14:20:03,201 GC for ParNew: 443 ms, 45124504 reclaimed leaving 8137412920 used; max is 9773776896 INFO 14:20:04,314 GC for ParNew: 438 ms, 54158832 reclaimed leaving 8310139720 used; max is 9773776896 INFO 14:20:05,547 GC for ParNew: 409 ms, 56888760 reclaimed leaving 8480136592 used; max is 9773776896 INFO 14:20:06,900 GC for ParNew: 441 ms, 58149704 reclaimed leaving 8648872520 used; max is 9773776896 INFO 14:20:08,904 GC for ParNew: 462 ms, 59185992 reclaimed leaving 8816581312 used; max is 9773776896 INFO 14:20:09,973 GC for ParNew: 460 ms, 57403840 reclaimed leaving 8986063136 used; max is 9773776896 INFO 14:20:11,976 GC for ParNew: 447 ms, 59814376 reclaimed leaving 9153134392 used; max is 9773776896 INFO 14:20:13,150 GC for ParNew: 441 ms, 61879728 reclaimed leaving 9318140296 used; max is 9773776896 java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid10913.hprof ... INFO 14:22:30,620 InetAddress /10.71.71.66 is now dead. INFO 14:22:30,621 InetAddress /10.71.71.65 is now dead. INFO 14:22:30,621 GC for ConcurrentMarkSweep: 44862 ms, 261200 reclaimed leaving 9334753480 used; max is 9773776896 INFO 14:22:30,621 InetAddress /10.71.71.64 is now dead. Heap dump file created [12730501093 bytes in 253.445 secs] ERROR 14:28:08,945 Uncaught exception in thread Thread[Thread-2288,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) ERROR 14:28:08,948 Uncaught exception in thread Thread[Thread-2281,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:71) INFO 14:28:09,017 GC for ConcurrentMarkSweep: 33737 ms, 85880 reclaimed leaving 9335215296 used; max is 9773776896 Does anyone have any ideas what is going on? -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com Hey guys thanks for the help. I had lowered my Xmx from 12GB to 10xmx because I saw: [r...@cdbsd09 ~]# /usr/local/cassandra/bin/nodetool --host 10.71.71.59 --port 8585 info 123739042516704895804863493611552076888 Load : 68.91 GB Generation No: 1281407425 Uptime (seconds) : 1459 Heap Memory (MB) : 6476.70 / 12261.00 This was happening: [r...@cdbsd11 ~]# /usr/local/cassandra/bin/nodetool --host cdbsd09.hadoop.pvt --port 8585 tpstats Pool NameActive Pending Completed STREAM-STAGE 0 0 0 RESPONSE-STAGE0 0 16478 ROW-READ-STAGE 64 4014 18190 LB-OPERATIONS 0 0 0 MESSAGE-DESERIALIZER-POOL 0 0 60290 GMFD 0 0385 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 0 7526 ROW-MUTATION-STAGE 64 908 182612 MESSAGE-STREAMING-POOL0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0 8 FLUSH-WRITER-POOL 0 0 8 AE-SERVICE-STAGE 0 0 0 HINTED-HANDOFF-POOL 1 9 6 After raising the level I realized I was maxing out the heap. The other nodes are running fine with xmx9GB but I guess these nodes can not. Thanks again. Edward
a plea not to remove rowsize warning
Hello all, I recently posted on list about a situation where two of my nodes from my 16 node were garbage collecting and at ooming. I was able to move my xmx from 9gb to 11gb to see that rather then my memory saw tooth. I would saw tooth around 4 gb before memory shot up like a rocket. After digging around I noticed the jmx row stats on that node said maxrowcompacted size = 128 mb. While the mean row size was 2000 byes. At the time I was unaware of the setting that warns of large rows. During compaction. Unfortunately this setting is too high by default. 512 mb, since I have been using rowcache. When something get this key extreme memory pressure is put on the system to get it in and out of row cache. I wa able to lower this setting to 10 mb and a got printed nice warnings showing me the offending keys. I do not know how this got their. My guess is null is getting encoded into this key and this key becomes the graveyard for bad data. Until the rowcache can handle the large keys better I find it imperitive to keep the setting and the warnings. As making a program to range scan all the data to find one big. Key is very intensive.
Re: indexing rows ordered by int
On Sunday, August 15, 2010, S Ahmed sahmed1...@gmail.com wrote: For CF that I need to perform range scans on, I create separate CF that have custom ordering. Say a CF holds comments on a story (like comments on a reddit or digg story post) So if I need to order comments by votes, it seems I have to re-index every time someone votes on a comment (or batch it every x minutes). Right now I think I have to pull all the comments into memory, then sort by votes, then re-write the index. Are there any best-practises for this type of index? It seems that most stories will have few comments 1-100. If you are only looking to order comments on a given article by vote this seems like something you would want to store with the article and or calculate on the fly. Unless you were looking for a feature like ,show highest rated comment across all articles, I do not understand why you would need a separate cf. Does my suggestion make sense ?if not, can share your storage.xml ?
Hive Storage Handler for Cassandra
Hello, Anyone interested in doing map/reduce on Cassandra data should take a look at Cassandra Storage Handler for Hive. Storage handlers give Hive the ability to work with data outside HDFS in a more natural way. Support is now in place for reading and writing to/from Standard Column Families (no super column support yet). While this allows users to use an SQL like language on their Cassandra data, it does NOT do things like push down of a where clause into sub-second queries. https://issues.apache.org/jira/browse/HIVE-1434 For those looking to try this out with minimal effort, I have a tar bundle with cassandra, hive, and hadoop here: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/test_drive_hive_cassandra_integration ::Warning:: The bundle is a pre-release build of Hive with cassandra support. Treat it as such. Enjoy, Edward
Re: cache sizes using percentages
On Tue, Aug 17, 2010 at 1:55 PM, Artie Copeland yeslinux@gmail.com wrote: if i set a key cache size of 100% the way i understand how that works is: - the cache is not write through, but read through - a key gets added to the cache on the first read if not already available - the size of the cache will always increase for ever item read. so if you have 100mil items your key cache will grow to 100mil Here are my questions: if that is the case then what happens if you only have enough mem to store 10mil items in your key cache? do you lose the other 90% how is it determined what is removed? will the server keep adding til it gets OOM? if you add a row cache as well how does that affect your percentage? if there a priority between the cache? or are they independant so both will try to be satisfied which would result in an OOM? thanx, artie -- http://yeslinux.org http://yestech.org Artie, In my experience, what ends up happening.. You start your server and all is well, your cache builds up, cache hit rate keeps climbing! Of course so does memory usage. At some point you start reaching your XMX. Java keeps trying to garbage collect often. A couple things can happen, all of them bad. One is just hitting an OOM. Another thing that can happen is the JVM spends too much time garbage collection and so little time processing its throws another exception (might be a subtype of OOM). do you lose the other 90% how is it determined what is removed? Items are removed when full is reached actual memory usage is NOT taken into account. if you add a row cache as well how does that affect your percentage? Mutually exclusive. if there a priority between the cache? No