Re: Capacity problem with a lot of writes?
Making compaction parallel isn't a priority because the problem is almost always the opposite: how do we spread it out over a longer period of time instead of sharp spikes of activity that hurt read/write latency. I'd be very surprised if latency would be acceptable if you did have parallel compaction. In other words, your real problem is you need more capacity for your workload. On Thu, Nov 25, 2010 at 5:18 PM, Carlos Alvarez cbalva...@gmail.com wrote: When you say that it grows constantly, does that mean up to 30 or even farther? My total data size is 2TB Actually, I never see the count stable. When it reached 30 I thinked I am reaching the default upper limit for a compaction, something went wrong and I went back to 1GB memtables (also, I saw bigger read latencies). Well, I think you are right: I am CPU bounded on compaction, because I see during compactions a single jvm thread which is almost all the time in running state and the disk is not used beyond 50%. (A nice future improvement would be to allow for concurrent compaction so that Cassandra would be able to utilize multiple CPU cores which may mitigate this if you have left-over CPU. However, this is not currently supported.) Yes, sure. I'd be happy to test, but I don't dare to alter the code :-) I think that a partial solution would help: if the compaction compacted to 'n' diferents new sstables (not one), the implementation would be easier. I mean, the compaction would compact, for instance, 10 sstables to 2 (being 2 the level of paralelism). In this way, the sstables count would remain eventually stable (although higher). What do you think? Carlos. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Capacity problem with a lot of writes?
Making compaction parallel isn't a priority because the problem is almost always the opposite: how do we spread it out over a longer period of time instead of sharp spikes of activity that hurt read/write latency. I'd be very surprised if latency would be acceptable if you did have parallel compaction. In other words, your real problem is you need more capacity for your workload. Do you expect this to be true even with the I/O situation improved (i.e., under conditions where the additional I/O is not a problem)? It seems counter-intuitive to me that single-core compaction would make a huge impact on latency when compaction is CPU bound on a 8+ core system under moderate load (even taking into account cache coherency/NUMA etc). -- / Peter Schuller
Re: Capacity problem with a lot of writes?
On Fri, Nov 26, 2010 at 10:49 AM, Peter Schuller peter.schul...@infidyne.com wrote: Making compaction parallel isn't a priority because the problem is almost always the opposite: how do we spread it out over a longer period of time instead of sharp spikes of activity that hurt read/write latency. I'd be very surprised if latency would be acceptable if you did have parallel compaction. In other words, your real problem is you need more capacity for your workload. Do you expect this to be true even with the I/O situation improved (i.e., under conditions where the additional I/O is not a problem)? It seems counter-intuitive to me that single-core compaction would make a huge impact on latency when compaction is CPU bound on a 8+ core system under moderate load (even taking into account cache coherency/NUMA etc). -- / Peter Schuller Carlos, I wanted to mention a specific technique I used to solve a situation I ran into. We had a large influx of data that pushed at our current hardware, as stated above the true answer was more hardware. However we ran into a situation where a single node failed several large compactions. We failed 2 or 3 big compactions we ended up with ~1000 SSTables for a column family. This turned into a chicken and egg situation where reads were slow because there were many sstables and extra data like tombstones. However the compaction was brutally slow from the read/write traffic. My solution was to create a side by side install on the same box, I used different data directories and different ports, /var/lib/cassandra/compact 9168 etc, moved the data to the new install and started it up. Then I ran nodetool compact on the new instance. This node was seeing no read or write traffic. I was surprised to see the machine was at 400%/1600% CPU used and not much io-wait. Compacting 600 GB of small SSTables took about 4 days. (However when sstables are larger I have compacted 400GB in 4 hours on the same hardware.) After which I moved the data file back in place and started the node back into the cluster. I have lived on both sides of the fence where i want long slow compactions or breakneck fast ones. I believe there is room for other compaction models. I am interested in systems that can optimize the case with multiple data directories for example. It seems like from my experiment a major compaction can not fully utilize hardware is specific conditions. Although knowing which ones to use where and how to automatically select the optimal strategy are interesting concerns.
Re: Capacity problem with a lot of writes?
On Fri, Nov 26, 2010 at 1:34 PM, Edward Capriolo edlinuxg...@gmail.com wrote: I believe there is room for other compaction models. I am interested in systems that can optimize the case with multiple data directories for example. It seems like from my experiment a major compaction can not fully utilize hardware is specific conditions. Although knowing which ones to use where and how to automatically select the optimal strategy are interesting concerns. Thank you for sharing your technique. I also think that a diferent model of compaction could be useful, esp in situations when the normal and nice compaction (the one which gives place to read/write to ocurr) is not enough to recover or when you have small windows of low activity with a lot of unused resources in the cluster. Carlos.
Re: Capacity problem with a lot of writes?
You want your memtables as large as is reasonable, but not too large. Sounds like yours are too large. As a first step, I would strongly recommend upgrading to 0.6.8 and reducing the compaction priority: http://www.riptano.com/blog/cassandra-annotated-changelog-063 On Thu, Nov 25, 2010 at 12:32 PM, Carlos Alvarez cbalva...@gmail.com wrote: Hello All. I am facing a (capacity?) problem in my eight nodes cluster running 0.6.2 patched with CASSANDRA-1014 and CASSANDRA-699. I have 200 writes/second on peaks (on each node, taking into account replication), with arow size of 35kb. I configured the memtable size to 1GB, the biggest size my heap seems to tolerate. With this configuration, the cluster is writing a sstable each 5 mins. When I decrease the memtable size I run into a minor compaction storm. However, the 1GB memtable forces me to have a hughe heap and makes me living on the edge of the memory. The question is: - Do I need more power in order to write less than 1GB each five minutes? - Has anyone experience with heaps of more than 8GBs? - Are the standard minor compaction thresholds usually enough for high loads? Or is something I am supposed to tune?. Thanks! Carlos. -- Tal vez hubo un error en la grafía. O en la articulación del Sacro Nombre. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Capacity problem with a lot of writes?
When I decrease the memtable size I run into a minor compaction storm. By minor compaction storm, do you mean that compaction (whenever it runs) degrades your node performance too much and smaller sstables means this happens more often; or do you mean that compaction is not keeping up with writes at all over time? -- / Peter Schuller
Re: Capacity problem with a lot of writes?
Thank you very much you both Jonathan and Peter. I will upgrade. However, the point has to do with the fact Peter mentions. With smaller memtables I see that minor compaction is unable to keep up with the writes. The number of sstables grows constantly during my peaks hours. With 400MB memtables the cluster is always compacting and the number of sstables grows constantly. I don't see the cluster is io bounded even with compaction (disk utilization is bellow 60% during compactions) but I think that a large number of sstables affects my reads latency. Now I have 5-7 sstables during the peak hours and when I tried with smaller sstables I saw a 30 sstables (and then I got scared and rollbacked the change) Carlos.
Re: Capacity problem with a lot of writes?
However, the point has to do with the fact Peter mentions. With smaller memtables I see that minor compaction is unable to keep up with the writes. The number of sstables grows constantly during my peaks hours. With 400MB memtables the cluster is always compacting and the number of sstables grows constantly. I don't see the cluster is io bounded even with compaction (disk utilization is bellow 60% during compactions) but I think that a large number of sstables affects my reads latency. Now I have 5-7 sstables during the peak hours and when I tried with smaller sstables I saw a 30 sstables (and then I got scared and rollbacked the change) When you say that it grows constantly, does that mean up to 30 or even farther? Because it is expected that smaller sstables will give you higher sstable count spikes. Only one compaction runs at a time, and as larger compactions run, they will take some amount of time. Given that amount of time, with smaller memtable sizes the number of sstables that have time to be flushed in the mean time is higher. So a higher sstable count spike is not necessarily indicative that you're not keeping up, unless it just grows and grows indefinitely. But you're right that sstable count will affect the seek overhead of reads. What is your total data size? (Affects the maximum work necessary for the biggest compaction jobs.) With respect to your disk utilization: I assume your ~ 35 kb rows are made up of several smaller columns? (If not I would expect compaction to be disk bound rather than I/O bound, at least assuming you're not running with a very fast I/O device) In any case; if indeed you are in a position where the sstable counts are not just due to the results of large compactions allowing for several memtable flushes to happen in the mean time, and you are in fact not keeping up with writes due to being CPU bound, then yeah - basically that means you need more capacity to handle the load (unless you can re-model data to be less CPU heavy in Cassandra, but that seems like the wrong way to go in most cases). Given your 200 writes/second, assuming they are full fows of 35 kb, implies about 7 MB/second of writes. Given small enough column values it seems plausible that you'd be CPU bound on compaction (hand-wavy gut feelingly on my part). (A nice future improvement would be to allow for concurrent compaction so that Cassandra would be able to utilize multiple CPU cores which may mitigate this if you have left-over CPU. However, this is not currently supported.) -- / Peter Schuller
Re: Capacity problem with a lot of writes?
When you say that it grows constantly, does that mean up to 30 or even farther? My total data size is 2TB Actually, I never see the count stable. When it reached 30 I thinked I am reaching the default upper limit for a compaction, something went wrong and I went back to 1GB memtables (also, I saw bigger read latencies). Well, I think you are right: I am CPU bounded on compaction, because I see during compactions a single jvm thread which is almost all the time in running state and the disk is not used beyond 50%. (A nice future improvement would be to allow for concurrent compaction so that Cassandra would be able to utilize multiple CPU cores which may mitigate this if you have left-over CPU. However, this is not currently supported.) Yes, sure. I'd be happy to test, but I don't dare to alter the code :-) I think that a partial solution would help: if the compaction compacted to 'n' diferents new sstables (not one), the implementation would be easier. I mean, the compaction would compact, for instance, 10 sstables to 2 (being 2 the level of paralelism). In this way, the sstables count would remain eventually stable (although higher). What do you think? Carlos.
Re: Capacity problem with a lot of writes?
When you say that it grows constantly, does that mean up to 30 or even farther? My total data size is 2TB For the entire cluster? (I realized I was being ambiguous, I meant per node.) Actually, I never see the count stable. When it reached 30 I thinked I am reaching the default upper limit for a compaction, something went wrong and I went back to 1GB memtables (also, I saw bigger read latencies). Assuming 2 TB is for the entire cluster and you have 8 nodes, that's 250 GB per node. So yeah, that'll take a reasonable time to compact when larger or major compactions take place. Well, I think you are right: I am CPU bounded on compaction, because I see during compactions a single jvm thread which is almost all the time in running state and the disk is not used beyond 50%. Sounds like it. I think that a partial solution would help: if the compaction compacted to 'n' diferents new sstables (not one), the implementation would be easier. I mean, the compaction would compact, for instance, 10 sstables to 2 (being 2 the level of paralelism). In this way, the sstables count would remain eventually stable (although higher). What do you think? Yes, that's been my thinking. Even for a single huge column family, it doesn't really matter that individual large compactions take longer as long as smaller compactions keep happening concurrently. It's worth noting though that merely supporting concurrent compactions in the sense of spawning more than one thread to do it is only a partial solution; with sufficiently large column families you end up having to keep several (not just a couple) compactions going in order to ensure that large and medium compactions are run while at the same time allowing the smallest compactions to take place. For one thing you probably don't want to commit too many cores to compactions, and in addition you can run into issues with I/O becoming seek bound if you have too many processes trying to stream data concurrently. I think the optimal solution for this would involve something along the lines of having a fixed configurable compaction machine concurrency (number of threads), and then have those threads interleave an arbitrary number of compactions (do 1 gig, switch to the other compaction, do 1 gig, etc). That way you have direct control over actual CPU and I/O concurrency, and can independently make sure that compactions happen in proper prioritized order, constantly keeping sstable counts within appropriate bounds. (For comparison the problem is very similar to PostgreSQL and auto-vacuuming of databases containing tables with extreme differences in size. Allowing multiple vacuum processes helps solve this, but is has the equivalent scalability issues if you were to have a particularly problematic distribution of table sizes and workload.) -- / Peter Schuller
Re: Capacity problem with a lot of writes?
On Thu, Nov 25, 2010 at 9:09 PM, Peter Schuller peter.schul...@infidyne.com wrote: My total data size is 2TB For the entire cluster? (I realized I was being ambiguous, I meant per node.) Yes, for the entire cluster. Regarding the multithreaded compaction, I'll look into the code and do my best. I think that compaction is a limiting factor in my cluster, with a faster compaction I guess I could support more operations with the same hardware ( having less sstables using less memory to hold hughe memtables). Carlos.