Re: repair performance
I would zero in on network throughput, especially interrack trunks sent from my mobile Daemeon Reiydelle skype daemeon.c.m.reiydelle USA 415.501.0198 On Mar 17, 2017 2:07 PM, "Roland Otta"wrote: > hello, > > we are quite inexperienced with cassandra at the moment and are playing > around with a new cluster we built up for getting familiar with > cassandra and its possibilites. > > while getting familiar with that topic we recognized that repairs in > our cluster take a long time. To get an idea of our current setup here > are some numbers: > > our cluster currently consists of 4 nodes (replication factor 3). > these nodes are all on dedicated physical hardware in our own > datacenter. all of the nodes have > > 32 cores @2,9Ghz > 64 GB ram > 2 ssds (raid0) 900 GB each for data > 1 seperate hdd for OS + commitlogs > > current dataset: > approx 530 GB per node > 21 tables (biggest one has more than 200 GB / node) > > > i already tried setting compactionthroughput + streamingthroughput to > unlimited for testing purposes ... but that did not change anything. > > when checking system resources i cannot see any bottleneck (cpus are > pretty idle and we have no iowaits). > > when issuing a repair via > > nodetool repair -local on a node the repair takes longer than a day. > is this normal or could we normally expect a faster repair? > > i also recognized that initalizing of new nodes in the datacenter was > really slow (approx 50 mbit/s). also here i expected a much better > performance - could those 2 problems be somehow related? > > br// > roland
Re: repair performance
good point! i did not (so far) i will do that - especially because i often see all compaction threads being used during repair (according to compactionstats). thank you also for your link recommendations. i will go through them. On Sat, 2017-03-18 at 16:54 +, Thakrar, Jayesh wrote: You changed compaction_throughput_mb_per_sec, but did you also increase concurrent_compactors? In reference to the reaper and some other info I received on the user forum to my question on "nodetool repair", here are some useful links/slides - https://www.datastax.com/dev/blog/repair-in-cassandra https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra/ http://www.slideshare.net/DataStax/real-world-tales-of-repair-alexander-dejanovski-the-last-pickle-cassandra-summit-2016 http://www.slideshare.net/DataStax/real-world-repairs-vinay-chella-netflix-cassandra-summit-2016 From: Roland Otta <roland.o...@willhaben.at> Date: Friday, March 17, 2017 at 5:47 PM To: "user@cassandra.apache.org" <user@cassandra.apache.org> Subject: Re: repair performance did not recognize that so far. thank you for the hint. i will definitely give it a try On Fri, 2017-03-17 at 22:32 +0100, benjamin roth wrote: The fork from thelastpickle is. I'd recommend to give it a try over pure nodetool. 2017-03-17 22:30 GMT+01:00 Roland Otta <roland.o...@willhaben.at<mailto:roland.o...@willhaben.at>>: forgot to mention the version we are using: we are using 3.0.7 - so i guess we should have incremental repairs by default. it also prints out incremental:true when starting a repair INFO [Thread-7281] 2017-03-17 09:40:32,059 RepairRunnable.java:125 - Starting repair command #7, repairing keyspace xxx with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [], # of ranges: 1758) 3.0.7 is also the reason why we are not using reaper ... as far as i could figure out it's not compatible with 3.0+ On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote: It depends a lot ... - Repairs can be very slow, yes! (And unreliable, due to timeouts, outages, whatever) - You can use incremental repairs to speed things up for regular repairs - You can use "reaper" to schedule repairs and run them sliced, automated, failsafe The time repairs actually may vary a lot depending on how much data has to be streamed or how inconsistent your cluster is. 50mbit/s is really a bit low! The actual performance depends on so many factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old nodes" of the cluster. This is a quite individual problem you have to track down individually. 2017-03-17 22:07 GMT+01:00 Roland Otta <roland.o...@willhaben.at<mailto:roland.o...@willhaben.at>>: hello, we are quite inexperienced with cassandra at the moment and are playing around with a new cluster we built up for getting familiar with cassandra and its possibilites. while getting familiar with that topic we recognized that repairs in our cluster take a long time. To get an idea of our current setup here are some numbers: our cluster currently consists of 4 nodes (replication factor 3). these nodes are all on dedicated physical hardware in our own datacenter. all of the nodes have 32 cores @2,9Ghz 64 GB ram 2 ssds (raid0) 900 GB each for data 1 seperate hdd for OS + commitlogs current dataset: approx 530 GB per node 21 tables (biggest one has more than 200 GB / node) i already tried setting compactionthroughput + streamingthroughput to unlimited for testing purposes ... but that did not change anything. when checking system resources i cannot see any bottleneck (cpus are pretty idle and we have no iowaits). when issuing a repair via nodetool repair -local on a node the repair takes longer than a day. is this normal or could we normally expect a faster repair? i also recognized that initalizing of new nodes in the datacenter was really slow (approx 50 mbit/s). also here i expected a much better performance - could those 2 problems be somehow related? br// roland
Re: repair performance
You changed compaction_throughput_mb_per_sec, but did you also increase concurrent_compactors? In reference to the reaper and some other info I received on the user forum to my question on "nodetool repair", here are some useful links/slides - https://www.datastax.com/dev/blog/repair-in-cassandra https://www.pythian.com/blog/effective-anti-entropy-repair-cassandra/ http://www.slideshare.net/DataStax/real-world-tales-of-repair-alexander-dejanovski-the-last-pickle-cassandra-summit-2016 http://www.slideshare.net/DataStax/real-world-repairs-vinay-chella-netflix-cassandra-summit-2016 From: Roland Otta <roland.o...@willhaben.at> Date: Friday, March 17, 2017 at 5:47 PM To: "user@cassandra.apache.org" <user@cassandra.apache.org> Subject: Re: repair performance did not recognize that so far. thank you for the hint. i will definitely give it a try On Fri, 2017-03-17 at 22:32 +0100, benjamin roth wrote: The fork from thelastpickle is. I'd recommend to give it a try over pure nodetool. 2017-03-17 22:30 GMT+01:00 Roland Otta <roland.o...@willhaben.at<mailto:roland.o...@willhaben.at>>: forgot to mention the version we are using: we are using 3.0.7 - so i guess we should have incremental repairs by default. it also prints out incremental:true when starting a repair INFO [Thread-7281] 2017-03-17 09:40:32,059 RepairRunnable.java:125 - Starting repair command #7, repairing keyspace xxx with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [], # of ranges: 1758) 3.0.7 is also the reason why we are not using reaper ... as far as i could figure out it's not compatible with 3.0+ On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote: It depends a lot ... - Repairs can be very slow, yes! (And unreliable, due to timeouts, outages, whatever) - You can use incremental repairs to speed things up for regular repairs - You can use "reaper" to schedule repairs and run them sliced, automated, failsafe The time repairs actually may vary a lot depending on how much data has to be streamed or how inconsistent your cluster is. 50mbit/s is really a bit low! The actual performance depends on so many factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old nodes" of the cluster. This is a quite individual problem you have to track down individually. 2017-03-17 22:07 GMT+01:00 Roland Otta <roland.o...@willhaben.at<mailto:roland.o...@willhaben.at>>: hello, we are quite inexperienced with cassandra at the moment and are playing around with a new cluster we built up for getting familiar with cassandra and its possibilites. while getting familiar with that topic we recognized that repairs in our cluster take a long time. To get an idea of our current setup here are some numbers: our cluster currently consists of 4 nodes (replication factor 3). these nodes are all on dedicated physical hardware in our own datacenter. all of the nodes have 32 cores @2,9Ghz 64 GB ram 2 ssds (raid0) 900 GB each for data 1 seperate hdd for OS + commitlogs current dataset: approx 530 GB per node 21 tables (biggest one has more than 200 GB / node) i already tried setting compactionthroughput + streamingthroughput to unlimited for testing purposes ... but that did not change anything. when checking system resources i cannot see any bottleneck (cpus are pretty idle and we have no iowaits). when issuing a repair via nodetool repair -local on a node the repair takes longer than a day. is this normal or could we normally expect a faster repair? i also recognized that initalizing of new nodes in the datacenter was really slow (approx 50 mbit/s). also here i expected a much better performance - could those 2 problems be somehow related? br// roland
Re: repair performance
did not recognize that so far. thank you for the hint. i will definitely give it a try On Fri, 2017-03-17 at 22:32 +0100, benjamin roth wrote: The fork from thelastpickle is. I'd recommend to give it a try over pure nodetool. 2017-03-17 22:30 GMT+01:00 Roland Otta>: forgot to mention the version we are using: we are using 3.0.7 - so i guess we should have incremental repairs by default. it also prints out incremental:true when starting a repair INFO [Thread-7281] 2017-03-17 09:40:32,059 RepairRunnable.java:125 - Starting repair command #7, repairing keyspace xxx with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [], # of ranges: 1758) 3.0.7 is also the reason why we are not using reaper ... as far as i could figure out it's not compatible with 3.0+ On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote: It depends a lot ... - Repairs can be very slow, yes! (And unreliable, due to timeouts, outages, whatever) - You can use incremental repairs to speed things up for regular repairs - You can use "reaper" to schedule repairs and run them sliced, automated, failsafe The time repairs actually may vary a lot depending on how much data has to be streamed or how inconsistent your cluster is. 50mbit/s is really a bit low! The actual performance depends on so many factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old nodes" of the cluster. This is a quite individual problem you have to track down individually. 2017-03-17 22:07 GMT+01:00 Roland Otta >: hello, we are quite inexperienced with cassandra at the moment and are playing around with a new cluster we built up for getting familiar with cassandra and its possibilites. while getting familiar with that topic we recognized that repairs in our cluster take a long time. To get an idea of our current setup here are some numbers: our cluster currently consists of 4 nodes (replication factor 3). these nodes are all on dedicated physical hardware in our own datacenter. all of the nodes have 32 cores @2,9Ghz 64 GB ram 2 ssds (raid0) 900 GB each for data 1 seperate hdd for OS + commitlogs current dataset: approx 530 GB per node 21 tables (biggest one has more than 200 GB / node) i already tried setting compactionthroughput + streamingthroughput to unlimited for testing purposes ... but that did not change anything. when checking system resources i cannot see any bottleneck (cpus are pretty idle and we have no iowaits). when issuing a repair via nodetool repair -local on a node the repair takes longer than a day. is this normal or could we normally expect a faster repair? i also recognized that initalizing of new nodes in the datacenter was really slow (approx 50 mbit/s). also here i expected a much better performance - could those 2 problems be somehow related? br// roland
Re: repair performance
The fork from thelastpickle is. I'd recommend to give it a try over pure nodetool. 2017-03-17 22:30 GMT+01:00 Roland Otta: > forgot to mention the version we are using: > > we are using 3.0.7 - so i guess we should have incremental repairs by > default. > it also prints out incremental:true when starting a repair > INFO [Thread-7281] 2017-03-17 09:40:32,059 RepairRunnable.java:125 - > Starting repair command #7, repairing keyspace xxx with repair options > (parallelism: parallel, primary range: false, incremental: true, job > threads: 1, ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [], # of > ranges: 1758) > > 3.0.7 is also the reason why we are not using reaper ... as far as i could > figure out it's not compatible with 3.0+ > > > > On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote: > > It depends a lot ... > > - Repairs can be very slow, yes! (And unreliable, due to timeouts, > outages, whatever) > - You can use incremental repairs to speed things up for regular repairs > - You can use "reaper" to schedule repairs and run them sliced, automated, > failsafe > > The time repairs actually may vary a lot depending on how much data has to > be streamed or how inconsistent your cluster is. > > 50mbit/s is really a bit low! The actual performance depends on so many > factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old > nodes" of the cluster. > This is a quite individual problem you have to track down individually. > > 2017-03-17 22:07 GMT+01:00 Roland Otta : > > hello, > > we are quite inexperienced with cassandra at the moment and are playing > around with a new cluster we built up for getting familiar with > cassandra and its possibilites. > > while getting familiar with that topic we recognized that repairs in > our cluster take a long time. To get an idea of our current setup here > are some numbers: > > our cluster currently consists of 4 nodes (replication factor 3). > these nodes are all on dedicated physical hardware in our own > datacenter. all of the nodes have > > 32 cores @2,9Ghz > 64 GB ram > 2 ssds (raid0) 900 GB each for data > 1 seperate hdd for OS + commitlogs > > current dataset: > approx 530 GB per node > 21 tables (biggest one has more than 200 GB / node) > > > i already tried setting compactionthroughput + streamingthroughput to > unlimited for testing purposes ... but that did not change anything. > > when checking system resources i cannot see any bottleneck (cpus are > pretty idle and we have no iowaits). > > when issuing a repair via > > nodetool repair -local on a node the repair takes longer than a day. > is this normal or could we normally expect a faster repair? > > i also recognized that initalizing of new nodes in the datacenter was > really slow (approx 50 mbit/s). also here i expected a much better > performance - could those 2 problems be somehow related? > > br// > roland > > >
Re: repair performance
... maybe i should just try increasing the job threads with --job-threads shame on me On Fri, 2017-03-17 at 21:30 +, Roland Otta wrote: forgot to mention the version we are using: we are using 3.0.7 - so i guess we should have incremental repairs by default. it also prints out incremental:true when starting a repair INFO [Thread-7281] 2017-03-17 09:40:32,059 RepairRunnable.java:125 - Starting repair command #7, repairing keyspace xxx with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [], # of ranges: 1758) 3.0.7 is also the reason why we are not using reaper ... as far as i could figure out it's not compatible with 3.0+ On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote: It depends a lot ... - Repairs can be very slow, yes! (And unreliable, due to timeouts, outages, whatever) - You can use incremental repairs to speed things up for regular repairs - You can use "reaper" to schedule repairs and run them sliced, automated, failsafe The time repairs actually may vary a lot depending on how much data has to be streamed or how inconsistent your cluster is. 50mbit/s is really a bit low! The actual performance depends on so many factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old nodes" of the cluster. This is a quite individual problem you have to track down individually. 2017-03-17 22:07 GMT+01:00 Roland Otta>: hello, we are quite inexperienced with cassandra at the moment and are playing around with a new cluster we built up for getting familiar with cassandra and its possibilites. while getting familiar with that topic we recognized that repairs in our cluster take a long time. To get an idea of our current setup here are some numbers: our cluster currently consists of 4 nodes (replication factor 3). these nodes are all on dedicated physical hardware in our own datacenter. all of the nodes have 32 cores @2,9Ghz 64 GB ram 2 ssds (raid0) 900 GB each for data 1 seperate hdd for OS + commitlogs current dataset: approx 530 GB per node 21 tables (biggest one has more than 200 GB / node) i already tried setting compactionthroughput + streamingthroughput to unlimited for testing purposes ... but that did not change anything. when checking system resources i cannot see any bottleneck (cpus are pretty idle and we have no iowaits). when issuing a repair via nodetool repair -local on a node the repair takes longer than a day. is this normal or could we normally expect a faster repair? i also recognized that initalizing of new nodes in the datacenter was really slow (approx 50 mbit/s). also here i expected a much better performance - could those 2 problems be somehow related? br// roland
Re: repair performance
forgot to mention the version we are using: we are using 3.0.7 - so i guess we should have incremental repairs by default. it also prints out incremental:true when starting a repair INFO [Thread-7281] 2017-03-17 09:40:32,059 RepairRunnable.java:125 - Starting repair command #7, repairing keyspace xxx with repair options (parallelism: parallel, primary range: false, incremental: true, job threads: 1, ColumnFamilies: [], dataCenters: [ProdDC2], hosts: [], # of ranges: 1758) 3.0.7 is also the reason why we are not using reaper ... as far as i could figure out it's not compatible with 3.0+ On Fri, 2017-03-17 at 22:13 +0100, benjamin roth wrote: It depends a lot ... - Repairs can be very slow, yes! (And unreliable, due to timeouts, outages, whatever) - You can use incremental repairs to speed things up for regular repairs - You can use "reaper" to schedule repairs and run them sliced, automated, failsafe The time repairs actually may vary a lot depending on how much data has to be streamed or how inconsistent your cluster is. 50mbit/s is really a bit low! The actual performance depends on so many factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old nodes" of the cluster. This is a quite individual problem you have to track down individually. 2017-03-17 22:07 GMT+01:00 Roland Otta>: hello, we are quite inexperienced with cassandra at the moment and are playing around with a new cluster we built up for getting familiar with cassandra and its possibilites. while getting familiar with that topic we recognized that repairs in our cluster take a long time. To get an idea of our current setup here are some numbers: our cluster currently consists of 4 nodes (replication factor 3). these nodes are all on dedicated physical hardware in our own datacenter. all of the nodes have 32 cores @2,9Ghz 64 GB ram 2 ssds (raid0) 900 GB each for data 1 seperate hdd for OS + commitlogs current dataset: approx 530 GB per node 21 tables (biggest one has more than 200 GB / node) i already tried setting compactionthroughput + streamingthroughput to unlimited for testing purposes ... but that did not change anything. when checking system resources i cannot see any bottleneck (cpus are pretty idle and we have no iowaits). when issuing a repair via nodetool repair -local on a node the repair takes longer than a day. is this normal or could we normally expect a faster repair? i also recognized that initalizing of new nodes in the datacenter was really slow (approx 50 mbit/s). also here i expected a much better performance - could those 2 problems be somehow related? br// roland
Re: repair performance
It depends a lot ... - Repairs can be very slow, yes! (And unreliable, due to timeouts, outages, whatever) - You can use incremental repairs to speed things up for regular repairs - You can use "reaper" to schedule repairs and run them sliced, automated, failsafe The time repairs actually may vary a lot depending on how much data has to be streamed or how inconsistent your cluster is. 50mbit/s is really a bit low! The actual performance depends on so many factors like your CPU, RAM, HD/SSD, concurrency settings, load of the "old nodes" of the cluster. This is a quite individual problem you have to track down individually. 2017-03-17 22:07 GMT+01:00 Roland Otta: > hello, > > we are quite inexperienced with cassandra at the moment and are playing > around with a new cluster we built up for getting familiar with > cassandra and its possibilites. > > while getting familiar with that topic we recognized that repairs in > our cluster take a long time. To get an idea of our current setup here > are some numbers: > > our cluster currently consists of 4 nodes (replication factor 3). > these nodes are all on dedicated physical hardware in our own > datacenter. all of the nodes have > > 32 cores @2,9Ghz > 64 GB ram > 2 ssds (raid0) 900 GB each for data > 1 seperate hdd for OS + commitlogs > > current dataset: > approx 530 GB per node > 21 tables (biggest one has more than 200 GB / node) > > > i already tried setting compactionthroughput + streamingthroughput to > unlimited for testing purposes ... but that did not change anything. > > when checking system resources i cannot see any bottleneck (cpus are > pretty idle and we have no iowaits). > > when issuing a repair via > > nodetool repair -local on a node the repair takes longer than a day. > is this normal or could we normally expect a faster repair? > > i also recognized that initalizing of new nodes in the datacenter was > really slow (approx 50 mbit/s). also here i expected a much better > performance - could those 2 problems be somehow related? > > br// > roland