Re: Incremental repairs leading to unrepaired data
Can't say I have too many ideas. If load is low during the repair it shouldn't be happening. Your disks aren't overutilised correct? No other processes writing loads of data to them?
Re: Incremental repairs leading to unrepaired data
That is not happening anymore since I am repairing a keyspace with much less data (the other one is still there in write-only mode). The command I am using is the most boring (even shed the -pr option so to keep anticompactions to a minimum): nodetool -h localhost repair It's executed sequentially on each node (no overlapping, next node waits for the previous to complete). Regards, Stefano Ortolani On Mon, Oct 31, 2016 at 11:18 PM, kurt Greaveswrote: > Blowing out to 1k SSTables seems a bit full on. What args are you passing to > repair? > > Kurt Greaves > k...@instaclustr.com > www.instaclustr.com > > On 31 October 2016 at 09:49, Stefano Ortolani wrote: >> >> I've collected some more data-points, and I still see dropped >> mutations with compaction_throughput_mb_per_sec set to 8. >> The only notable thing regarding the current setup is that I have >> another keyspace (not being repaired though) with really wide rows >> (100MB per partition), but that shouldn't have any impact in theory. >> Nodes do not seem that overloaded either and don't see any GC spikes >> while those mutations are dropped :/ >> >> Hitting a dead end here, any further idea where to look for further ideas? >> >> Regards, >> Stefano >> >> On Wed, Aug 10, 2016 at 12:41 PM, Stefano Ortolani >> wrote: >> > That's what I was thinking. Maybe GC pressure? >> > Some more details: during anticompaction I have some CFs exploding to 1K >> > SStables (to be back to ~200 upon completion). >> > HW specs should be quite good (12 cores/32 GB ram) but, I admit, still >> > relying on spinning disks, with ~150GB per node. >> > Current version is 3.0.8. >> > >> > >> > On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta >> > wrote: >> >> >> >> That's pretty low already, but perhaps you should lower to see if it >> >> will >> >> improve the dropped mutations during anti-compaction (even if it >> >> increases >> >> repair time), otherwise the problem might be somewhere else. Generally >> >> dropped mutations is a signal of cluster overload, so if there's >> >> nothing >> >> else wrong perhaps you need to increase your capacity. What version are >> >> you >> >> in? >> >> >> >> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani : >> >>> >> >>> Not yet. Right now I have it set at 16. >> >>> Would halving it more or less double the repair time? >> >>> >> >>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta >> >>> wrote: >> >> Anticompaction throttling can be done by setting the usual >> compaction_throughput_mb_per_sec knob on cassandra.yaml or via >> nodetool >> setcompactionthroughput. Did you try lowering that and checking if >> that >> improves the dropped mutations? >> >> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani : >> > >> > Hi all, >> > >> > I am running incremental repaird on a weekly basis (can't do it >> > every >> > day as one single run takes 36 hours), and every time, I have at >> > least one >> > node dropping mutations as part of the process (this almost always >> > during >> > the anticompaction phase). Ironically this leads to a system where >> > repairing >> > makes data consistent at the cost of making some other data not >> > consistent. >> > >> > Does anybody know why this is happening? >> > >> > My feeling is that this might be caused by anticompacting column >> > families with really wide rows and with many SStables. If that is >> > the case, >> > any way I can throttle that? >> > >> > Thanks! >> > Stefano >> >> >> >>> >> >> >> > > >
Re: Incremental repairs leading to unrepaired data
Blowing out to 1k SSTables seems a bit full on. What args are you passing to repair? Kurt Greaves k...@instaclustr.com www.instaclustr.com On 31 October 2016 at 09:49, Stefano Ortolaniwrote: > I've collected some more data-points, and I still see dropped > mutations with compaction_throughput_mb_per_sec set to 8. > The only notable thing regarding the current setup is that I have > another keyspace (not being repaired though) with really wide rows > (100MB per partition), but that shouldn't have any impact in theory. > Nodes do not seem that overloaded either and don't see any GC spikes > while those mutations are dropped :/ > > Hitting a dead end here, any further idea where to look for further ideas? > > Regards, > Stefano > > On Wed, Aug 10, 2016 at 12:41 PM, Stefano Ortolani > wrote: > > That's what I was thinking. Maybe GC pressure? > > Some more details: during anticompaction I have some CFs exploding to 1K > > SStables (to be back to ~200 upon completion). > > HW specs should be quite good (12 cores/32 GB ram) but, I admit, still > > relying on spinning disks, with ~150GB per node. > > Current version is 3.0.8. > > > > > > On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta > > wrote: > >> > >> That's pretty low already, but perhaps you should lower to see if it > will > >> improve the dropped mutations during anti-compaction (even if it > increases > >> repair time), otherwise the problem might be somewhere else. Generally > >> dropped mutations is a signal of cluster overload, so if there's nothing > >> else wrong perhaps you need to increase your capacity. What version are > you > >> in? > >> > >> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani : > >>> > >>> Not yet. Right now I have it set at 16. > >>> Would halving it more or less double the repair time? > >>> > >>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta > >>> wrote: > > Anticompaction throttling can be done by setting the usual > compaction_throughput_mb_per_sec knob on cassandra.yaml or via > nodetool > setcompactionthroughput. Did you try lowering that and checking if > that > improves the dropped mutations? > > 2016-08-09 13:32 GMT-03:00 Stefano Ortolani : > > > > Hi all, > > > > I am running incremental repaird on a weekly basis (can't do it every > > day as one single run takes 36 hours), and every time, I have at > least one > > node dropping mutations as part of the process (this almost always > during > > the anticompaction phase). Ironically this leads to a system where > repairing > > makes data consistent at the cost of making some other data not > consistent. > > > > Does anybody know why this is happening? > > > > My feeling is that this might be caused by anticompacting column > > families with really wide rows and with many SStables. If that is > the case, > > any way I can throttle that? > > > > Thanks! > > Stefano > > > >>> > >> > > >
Re: Incremental repairs leading to unrepaired data
I've collected some more data-points, and I still see dropped mutations with compaction_throughput_mb_per_sec set to 8. The only notable thing regarding the current setup is that I have another keyspace (not being repaired though) with really wide rows (100MB per partition), but that shouldn't have any impact in theory. Nodes do not seem that overloaded either and don't see any GC spikes while those mutations are dropped :/ Hitting a dead end here, any further idea where to look for further ideas? Regards, Stefano On Wed, Aug 10, 2016 at 12:41 PM, Stefano Ortolaniwrote: > That's what I was thinking. Maybe GC pressure? > Some more details: during anticompaction I have some CFs exploding to 1K > SStables (to be back to ~200 upon completion). > HW specs should be quite good (12 cores/32 GB ram) but, I admit, still > relying on spinning disks, with ~150GB per node. > Current version is 3.0.8. > > > On Wed, Aug 10, 2016 at 12:36 PM, Paulo Motta > wrote: >> >> That's pretty low already, but perhaps you should lower to see if it will >> improve the dropped mutations during anti-compaction (even if it increases >> repair time), otherwise the problem might be somewhere else. Generally >> dropped mutations is a signal of cluster overload, so if there's nothing >> else wrong perhaps you need to increase your capacity. What version are you >> in? >> >> 2016-08-10 8:21 GMT-03:00 Stefano Ortolani : >>> >>> Not yet. Right now I have it set at 16. >>> Would halving it more or less double the repair time? >>> >>> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta >>> wrote: Anticompaction throttling can be done by setting the usual compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool setcompactionthroughput. Did you try lowering that and checking if that improves the dropped mutations? 2016-08-09 13:32 GMT-03:00 Stefano Ortolani : > > Hi all, > > I am running incremental repaird on a weekly basis (can't do it every > day as one single run takes 36 hours), and every time, I have at least one > node dropping mutations as part of the process (this almost always during > the anticompaction phase). Ironically this leads to a system where > repairing > makes data consistent at the cost of making some other data not > consistent. > > Does anybody know why this is happening? > > My feeling is that this might be caused by anticompacting column > families with really wide rows and with many SStables. If that is the > case, > any way I can throttle that? > > Thanks! > Stefano >>> >> >
Re: Incremental repairs leading to unrepaired data
That's what I was thinking. Maybe GC pressure? Some more details: during anticompaction I have some CFs exploding to 1K SStables (to be back to ~200 upon completion). HW specs should be quite good (12 cores/32 GB ram) but, I admit, still relying on spinning disks, with ~150GB per node. Current version is 3.0.8. On Wed, Aug 10, 2016 at 12:36 PM, Paulo Mottawrote: > That's pretty low already, but perhaps you should lower to see if it will > improve the dropped mutations during anti-compaction (even if it increases > repair time), otherwise the problem might be somewhere else. Generally > dropped mutations is a signal of cluster overload, so if there's nothing > else wrong perhaps you need to increase your capacity. What version are you > in? > > 2016-08-10 8:21 GMT-03:00 Stefano Ortolani : > >> Not yet. Right now I have it set at 16. >> Would halving it more or less double the repair time? >> >> On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta >> wrote: >> >>> Anticompaction throttling can be done by setting the usual >>> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool >>> setcompactionthroughput. Did you try lowering that and checking if that >>> improves the dropped mutations? >>> >>> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani : >>> Hi all, I am running incremental repaird on a weekly basis (can't do it every day as one single run takes 36 hours), and every time, I have at least one node dropping mutations as part of the process (this almost always during the anticompaction phase). Ironically this leads to a system where repairing makes data consistent at the cost of making some other data not consistent. Does anybody know why this is happening? My feeling is that this might be caused by anticompacting column families with really wide rows and with many SStables. If that is the case, any way I can throttle that? Thanks! Stefano >>> >>> >> >
Re: Incremental repairs leading to unrepaired data
That's pretty low already, but perhaps you should lower to see if it will improve the dropped mutations during anti-compaction (even if it increases repair time), otherwise the problem might be somewhere else. Generally dropped mutations is a signal of cluster overload, so if there's nothing else wrong perhaps you need to increase your capacity. What version are you in? 2016-08-10 8:21 GMT-03:00 Stefano Ortolani: > Not yet. Right now I have it set at 16. > Would halving it more or less double the repair time? > > On Tue, Aug 9, 2016 at 7:58 PM, Paulo Motta > wrote: > >> Anticompaction throttling can be done by setting the usual >> compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool >> setcompactionthroughput. Did you try lowering that and checking if that >> improves the dropped mutations? >> >> 2016-08-09 13:32 GMT-03:00 Stefano Ortolani : >> >>> Hi all, >>> >>> I am running incremental repaird on a weekly basis (can't do it every >>> day as one single run takes 36 hours), and every time, I have at least one >>> node dropping mutations as part of the process (this almost always during >>> the anticompaction phase). Ironically this leads to a system where >>> repairing makes data consistent at the cost of making some other data not >>> consistent. >>> >>> Does anybody know why this is happening? >>> >>> My feeling is that this might be caused by anticompacting column >>> families with really wide rows and with many SStables. If that is the case, >>> any way I can throttle that? >>> >>> Thanks! >>> Stefano >>> >> >> >
Re: Incremental repairs leading to unrepaired data
Not yet. Right now I have it set at 16. Would halving it more or less double the repair time? On Tue, Aug 9, 2016 at 7:58 PM, Paulo Mottawrote: > Anticompaction throttling can be done by setting the usual > compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool > setcompactionthroughput. Did you try lowering that and checking if that > improves the dropped mutations? > > 2016-08-09 13:32 GMT-03:00 Stefano Ortolani : > >> Hi all, >> >> I am running incremental repaird on a weekly basis (can't do it every day >> as one single run takes 36 hours), and every time, I have at least one node >> dropping mutations as part of the process (this almost always during the >> anticompaction phase). Ironically this leads to a system where repairing >> makes data consistent at the cost of making some other data not consistent. >> >> Does anybody know why this is happening? >> >> My feeling is that this might be caused by anticompacting column families >> with really wide rows and with many SStables. If that is the case, any way >> I can throttle that? >> >> Thanks! >> Stefano >> > >
Re: Incremental repairs leading to unrepaired data
Anticompaction throttling can be done by setting the usual compaction_throughput_mb_per_sec knob on cassandra.yaml or via nodetool setcompactionthroughput. Did you try lowering that and checking if that improves the dropped mutations? 2016-08-09 13:32 GMT-03:00 Stefano Ortolani: > Hi all, > > I am running incremental repaird on a weekly basis (can't do it every day > as one single run takes 36 hours), and every time, I have at least one node > dropping mutations as part of the process (this almost always during the > anticompaction phase). Ironically this leads to a system where repairing > makes data consistent at the cost of making some other data not consistent. > > Does anybody know why this is happening? > > My feeling is that this might be caused by anticompacting column families > with really wide rows and with many SStables. If that is the case, any way > I can throttle that? > > Thanks! > Stefano >