Re: Drastic increase in disk usage after starting repair on 3.7

2017-09-21 Thread Paul Pollack
So I got to the bottom of this -- turns out it's not an issue with
Cassandra at all. Seems that whenever these instances were set up we had
originally mounted 2TB drives from /dev/xvdc and those were persisted to
/etc/fstab, but at some point someone unmounted those and replaced them
with 4TB drives on /dev/xvdf, however, didn't fix fstab. So what has
esssentially happened is I brought a node back into the cluster with a
blank data drive and started a repair, which I'm guessing then went and
started adding all the data that just wasn't there at all. I've killed the
repair and am going to replace that node.

On Thu, Sep 21, 2017 at 7:58 AM, Paul Pollack 
wrote:

> Thanks for the suggestions guys.
>
> Nicolas, I just checked nodetool listsnapshots and it doesn't seem like
> those are causing the increase:
>
> Snapshot Details:
> Snapshot nameKeyspace name Column family
> name True size Size on disk
> 1479343904106-statistic_segment_timeline klaviyo
> statistic_segment_timeline 91.73 MiB 91.73 MiB
> 1479343904516-statistic_segment_timeline klaviyo
> statistic_segment_timeline 69.42 MiB 69.42 MiB
> 1479343904607-statistic_segment_timeline klaviyo
> statistic_segment_timeline 69.43 MiB 69.43 MiB
>
> Total TrueDiskSpaceUsed: 91.77 MiB
>
> Kurt, we definitely do have a large backlog of compactions, but I would
> expect only the currently running compactions to take up 2x extra space,
> and for that space to be freed up after its completion, is that an
> inaccurate idea of how compaction actually works? When the disk was almost
> full at 2TB I increased the EBS volume to 3TB, and now it's using 2.6TB so
> I think it's only a matter of hours before it takes up the space on the
> rest of the volume. The largest files on disk are *-big-Data.db files. Is
> there anything else I can check that might indicate whether or not the
> repair is really the root cause of this issue?
>
> Thanks,
> Paul
>
> On Thu, Sep 21, 2017 at 4:02 AM, Nicolas Guyomar <
> nicolas.guyo...@gmail.com> wrote:
>
>> Hi Paul,
>>
>> This might be a long shot, but some repairs might fail to clear their
>> snapshot (not sure if its still the case with C* 3.7 however, I had the
>> problem on 2.X branche).
>> What does nodetool listsnapshot indicate ?
>>
>> On 21 September 2017 at 05:49, kurt greaves  wrote:
>>
>>> repair does overstream by design, so if that node is inconsistent you'd
>>> expect a bit of an increase. if you've got a backlog of compactions that's
>>> probably due to repair and likely the cause of the increase. if you're
>>> really worried you can rolling restart to stop the repair, otherwise maybe
>>> try increasing compaction throughput.
>>>
>>
>>
>


Re: Drastic increase in disk usage after starting repair on 3.7

2017-09-21 Thread Paul Pollack
Thanks for the suggestions guys.

Nicolas, I just checked nodetool listsnapshots and it doesn't seem like
those are causing the increase:

Snapshot Details:
Snapshot nameKeyspace name Column family
name True size Size on disk
1479343904106-statistic_segment_timeline klaviyo
statistic_segment_timeline 91.73 MiB 91.73 MiB
1479343904516-statistic_segment_timeline klaviyo
statistic_segment_timeline 69.42 MiB 69.42 MiB
1479343904607-statistic_segment_timeline klaviyo
statistic_segment_timeline 69.43 MiB 69.43 MiB

Total TrueDiskSpaceUsed: 91.77 MiB

Kurt, we definitely do have a large backlog of compactions, but I would
expect only the currently running compactions to take up 2x extra space,
and for that space to be freed up after its completion, is that an
inaccurate idea of how compaction actually works? When the disk was almost
full at 2TB I increased the EBS volume to 3TB, and now it's using 2.6TB so
I think it's only a matter of hours before it takes up the space on the
rest of the volume. The largest files on disk are *-big-Data.db files. Is
there anything else I can check that might indicate whether or not the
repair is really the root cause of this issue?

Thanks,
Paul

On Thu, Sep 21, 2017 at 4:02 AM, Nicolas Guyomar 
wrote:

> Hi Paul,
>
> This might be a long shot, but some repairs might fail to clear their
> snapshot (not sure if its still the case with C* 3.7 however, I had the
> problem on 2.X branche).
> What does nodetool listsnapshot indicate ?
>
> On 21 September 2017 at 05:49, kurt greaves  wrote:
>
>> repair does overstream by design, so if that node is inconsistent you'd
>> expect a bit of an increase. if you've got a backlog of compactions that's
>> probably due to repair and likely the cause of the increase. if you're
>> really worried you can rolling restart to stop the repair, otherwise maybe
>> try increasing compaction throughput.
>>
>
>


Re: Drastic increase in disk usage after starting repair on 3.7

2017-09-21 Thread Nicolas Guyomar
Hi Paul,

This might be a long shot, but some repairs might fail to clear their
snapshot (not sure if its still the case with C* 3.7 however, I had the
problem on 2.X branche).
What does nodetool listsnapshot indicate ?

On 21 September 2017 at 05:49, kurt greaves  wrote:

> repair does overstream by design, so if that node is inconsistent you'd
> expect a bit of an increase. if you've got a backlog of compactions that's
> probably due to repair and likely the cause of the increase. if you're
> really worried you can rolling restart to stop the repair, otherwise maybe
> try increasing compaction throughput.
>


Re: Drastic increase in disk usage after starting repair on 3.7

2017-09-20 Thread kurt greaves
repair does overstream by design, so if that node is inconsistent you'd
expect a bit of an increase. if you've got a backlog of compactions that's
probably due to repair and likely the cause of the increase. if you're
really worried you can rolling restart to stop the repair, otherwise maybe
try increasing compaction throughput.


Re: Drastic increase in disk usage after starting repair on 3.7

2017-09-20 Thread Paul Pollack
Just a quick additional note -- we have checked and this is the only node
in the cluster exhibiting this behavior, disk usage is steady on all the
others. CPU load on the repairing node is slightly higher but nothing
significant.

On Wed, Sep 20, 2017 at 9:08 PM, Paul Pollack 
wrote:

> Hi,
>
> I'm running a repair on a node in my 3.7 cluster and today got alerted on
> disk space usage. We keep the data and commit log directories on separate
> EBS volumes. The data volume is 2TB. The node went down due to EBS failure
> on the commit log drive. I stopped the instance and was later told by AWS
> support that the drive had recovered. I started the node back up and saw
> that it couldn't replay commit logs due to corrupted data, so I cleared the
> commit logs and then it started up again just fine. I'm not worried about
> anything there that wasn't flushed, I can replay that. I was unfortunately
> just outside the hinted handoff window so decided to run a repair.
>
> Roughly 24 hours after I started the repair is when I got the alert on
> disk space. I checked and saw that right before I started the repair the
> node was using almost 1TB of space, which is right where all the nodes sit,
> and over the course of 24 hours had dropped to about 200GB free.
>
> My gut reaction was that the repair must have caused this increase, but
> I'm not convinced since the disk usage doubled and continues to grow. I
> figured we would see at most an increase of 2x the size of an SSTable
> undergoing compaction, unless there's more to the disk usage profile of a
> node during repair. We use SizeTieredCompactionStrategy on all the tables
> in this keyspace.
>
> Running nodetool compactionstats shows that there are a higher than usual
> number of pending compactions (currently 20), and there's been a large one
> of 292.82GB moving slowly.
>
> Is it plausible that the repair is the cause of this sudden increase in
> disk space usage? Are there any other things I can check that might provide
> insight into what happened?
>
> Thanks,
> Paul
>
>
>


Drastic increase in disk usage after starting repair on 3.7

2017-09-20 Thread Paul Pollack
Hi,

I'm running a repair on a node in my 3.7 cluster and today got alerted on
disk space usage. We keep the data and commit log directories on separate
EBS volumes. The data volume is 2TB. The node went down due to EBS failure
on the commit log drive. I stopped the instance and was later told by AWS
support that the drive had recovered. I started the node back up and saw
that it couldn't replay commit logs due to corrupted data, so I cleared the
commit logs and then it started up again just fine. I'm not worried about
anything there that wasn't flushed, I can replay that. I was unfortunately
just outside the hinted handoff window so decided to run a repair.

Roughly 24 hours after I started the repair is when I got the alert on disk
space. I checked and saw that right before I started the repair the node
was using almost 1TB of space, which is right where all the nodes sit, and
over the course of 24 hours had dropped to about 200GB free.

My gut reaction was that the repair must have caused this increase, but I'm
not convinced since the disk usage doubled and continues to grow. I figured
we would see at most an increase of 2x the size of an SSTable undergoing
compaction, unless there's more to the disk usage profile of a node during
repair. We use SizeTieredCompactionStrategy on all the tables in this
keyspace.

Running nodetool compactionstats shows that there are a higher than usual
number of pending compactions (currently 20), and there's been a large one
of 292.82GB moving slowly.

Is it plausible that the repair is the cause of this sudden increase in
disk space usage? Are there any other things I can check that might provide
insight into what happened?

Thanks,
Paul