Re: Dropping down replication factor

Brian Spindler Sun, 13 Aug 2017 08:00:15 -0700

Do you think with the setup I've described I'd be ok doing that now to
recover this node?


The node died trying to run the scrub; I've restarted it but I'm not sure
it's going to get past a scrub/repair, this is why I deleted the other
files as a brute force method.  I think I might have to do the same here
and then kick off a repair if I can't just replace it?

Doing the repair on the node that had the corrupt data deleted should be
ok?

On Sun, Aug 13, 2017 at 10:29 AM Jeff Jirsa <jji...@gmail.com> wrote:

> Running repairs when you have corrupt sstables can spread the corruption
>
> In 2.1.15, corruption is almost certainly from something like a bad disk
> or bad RAM
>
> One way to deal with corruption is to stop the node and replace is (with
> -Dcassandra.replace_address) so you restream data from neighbors. The
> challenge here is making sure you have a healthy replica for streaming
>
> Please make sure you have backups and snapshots if you have corruption
> popping up
>
> If you're using vnodes, once you get rid of the corruption you may
> consider adding another c node with fewer vnodes to try to get it joined
> faster with less data.
>
>
> --
> Jeff Jirsa
>
>
> On Aug 13, 2017, at 7:11 AM, Brian Spindler <brian.spind...@gmail.com>
> wrote:
>
> Hi Jeff, I ran the scrub online and that didn't help.  I went ahead and
> stopped the node, deleted all the corrupted data files <cf>-<num>-*.db
> files and planned on running a repair when it came back online.
>
> Unrelated I believe, now another CF is corrupted!
>
> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
> /ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db
> Caused by: org.apache.cassandra.io.compress.CorruptBlockException:
> (/ephemeral/cassandra/data/OpsCenter/rollups300-45c85324387b35238d056678f8fa8b0f/OpsCenter-rollups300-ka-100672-Data.db):
> corruption detected, chunk at 101500 of length 26523398.
>
> Few days ago when troubleshooting this I did change the OpsCenter keyspace
> RF == 2 from 3 since I thought that would help reduce load.  Did that cause
> this corruption?
>
> running *'nodetool scrub OpsCenter rollups300'* on that node now
>
> And now I also see this when running nodetool status:
>
> *"Note: Non-system keyspaces don't have the same replication settings,
> effective ownership information is meaningless"*
>
> What to do?
>
> I still can't stream to this new node cause of this corruption.  Disk
> space is getting low on these nodes ...
>
> On Sat, Aug 12, 2017 at 9:51 PM Brian Spindler <brian.spind...@gmail.com>
> wrote:
>
>> nothing in logs on the node that it was streaming from.
>>
>> however, I think I found the issue on the other node in the C rack:
>>
>> ERROR [STREAM-IN-/10.40.17.114] 2017-08-12 16:48:53,354
>> StreamSession.java:512 - [Stream #08957970-7f7e-11e7-b2a2-a31e21b877e5]
>> Streaming error occurred
>> org.apache.cassandra.io.sstable.CorruptSSTableException: Corrupted:
>> /ephemeral/cassandra/data/...
>>
>> I did a 'cat /var/log/cassandra/system.log|grep Corrupt'  and it seems
>> it's a single Index.db file and nothing on the other node.
>>
>> I think nodetool scrub or offline sstablescrub might be in order but with
>> the current load I'm not sure I can take it offline for very long.
>>
>> Thanks again for the help.
>>
>>
>> On Sat, Aug 12, 2017 at 9:38 PM Jeffrey Jirsa <jji...@gmail.com> wrote:
>>
>>> Compaction is backed up – that may be normal write load (because of the
>>> rack imbalance), or it may be a secondary index build. Hard to say for
>>> sure. ‘nodetool compactionstats’ if you’re able to provide it. The jstack
>>> probably not necessary, streaming is being marked as failed and it’s
>>> turning itself off. Not sure why streaming is marked as failing, though,
>>> anything on the sending sides?
>>>
>>>
>>>
>>>
>>>
>>> From: Brian Spindler <brian.spind...@gmail.com>
>>> Reply-To: <user@cassandra.apache.org>
>>> Date: Saturday, August 12, 2017 at 6:34 PM
>>> To: <user@cassandra.apache.org>
>>> Subject: Re: Dropping down replication factor
>>>
>>> Thanks for replying Jeff.
>>>
>>> Responses below.
>>>
>>> On Sat, Aug 12, 2017 at 8:33 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> Answers inline
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> > On Aug 12, 2017, at 2:58 PM, brian.spind...@gmail.com wrote:
>>>> >
>>>> > Hi folks, hopefully a quick one:
>>>> >
>>>> > We are running a 12 node cluster (2.1.15) in AWS with Ec2Snitch.
>>>> It's all in one region but spread across 3 availability zones.  It was
>>>> nicely balanced with 4 nodes in each.
>>>> >
>>>> > But with a couple of failures and subsequent provisions to the wrong
>>>> az we now have a cluster with :
>>>> >
>>>> > 5 nodes in az A
>>>> > 5 nodes in az B
>>>> > 2 nodes in az C
>>>> >
>>>> > Not sure why, but when adding a third node in AZ C it fails to stream
>>>> after getting all the way to completion and no apparent error in logs.
>>>> I've looked at a couple of bugs referring to scrubbing and possible OOM
>>>> bugs due to metadata writing at end of streaming (sorry don't have ticket
>>>> handy).  I'm worried I might not be able to do much with these since the
>>>> disk space usage is high and they are under a lot of load given the small
>>>> number of them for this rack.
>>>>
>>>> You'll definitely have higher load on az C instances with rf=3 in this
>>>> ratio
>>>
>>>
>>>> Streaming should still work - are you sure it's not busy doing
>>>> something? Like building secondary index or similar? jstack thread dump
>>>> would be useful, or at least nodetool tpstats
>>>>
>>>> Only other thing might be a backup.  We do incrementals x1hr and
>>> snapshots x24h; they are shipped to s3 then links are cleaned up.  The
>>> error I get on the node I'm trying to add to rack C is:
>>>
>>> ERROR [main] 2017-08-12 23:54:51,546 CassandraDaemon.java:583 -
>>> Exception encountered during startup
>>> java.lang.RuntimeException: Error during boostrap: Stream failed
>>>         at
>>> org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:87)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:1166)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:944)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:740)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:617)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:391)
>>> [apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:566)
>>> [apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:655)
>>> [apache-cassandra-2.1.15.jar:2.1.15]
>>> Caused by: org.apache.cassandra.streaming.StreamException: Stream failed
>>>         at
>>> org.apache.cassandra.streaming.management.StreamEventJMXNotifier.onFailure(StreamEventJMXNotifier.java:85)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> com.google.common.util.concurrent.Futures$4.run(Futures.java:1172)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> com.google.common.util.concurrent.ExecutionList.execute(ExecutionList.java:145)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:202)
>>> ~[guava-16.0.jar:na]
>>>         at
>>> org.apache.cassandra.streaming.StreamResultFuture.maybeComplete(StreamResultFuture.java:209)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.streaming.StreamResultFuture.handleSessionComplete(StreamResultFuture.java:185)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.streaming.StreamSession.closeSession(StreamSession.java:413)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.streaming.StreamSession.maybeCompleted(StreamSession.java:700)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.streaming.StreamSession.taskCompleted(StreamSession.java:661)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:179)
>>> ~[apache-cassandra-2.1.15.jar:2.1.15]
>>>         at
>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>> ~[na:1.8.0_112]
>>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>> ~[na:1.8.0_112]
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> ~[na:1.8.0_112]
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> ~[na:1.8.0_112]
>>>         at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_112]
>>> WARN  [StorageServiceShutdownHook] 2017-08-12 23:54:51,582
>>> Gossiper.java:1462 - No local state or state is in silent shutdown, not
>>> announcing shutdown
>>> INFO  [StorageServiceShutdownHook] 2017-08-12 23:54:51,582
>>> MessagingService.java:734 - Waiting for messaging service to quiesce
>>> INFO  [ACCEPT-/10.40.17.114] 2017-08-12 23:54:51,583
>>> MessagingService.java:1020 - MessagingService has terminated the accept()
>>> thread
>>>
>>> And I got this on this same node when it was bootstrapping, I ran
>>> 'nodetool netstats' just before it shutdown:
>>>
>>>         Receiving 377 files, 161928296443 bytes total. Already received
>>> 377 files, 161928296443 bytes total
>>>
>>> TPStats on host that was streaming the data to this node:
>>>
>>> Pool Name                    Active   Pending      Completed   Blocked
>>>  All time blocked
>>> MutationStage                     1         1     4488289014         0
>>>               0
>>> ReadStage                         0         0       24486526         0
>>>               0
>>> RequestResponseStage              0         0     3038847374
>>> <(303)%20884-7374>         0                 0
>>> ReadRepairStage                   0         0        1601576         0
>>>               0
>>> CounterMutationStage              0         0          68403         0
>>>               0
>>> MiscStage                         0         0              0         0
>>>               0
>>> AntiEntropySessions               0         0              0         0
>>>               0
>>> HintedHandoff                     0         0             18         0
>>>               0
>>> GossipStage                       0         0        2786892         0
>>>               0
>>> CacheCleanupExecutor              0         0              0         0
>>>               0
>>> InternalResponseStage             0         0          61115         0
>>>               0
>>> CommitLogArchiver                 0         0              0         0
>>>               0
>>> CompactionExecutor                4        83         304167         0
>>>               0
>>> ValidationExecutor                0         0          78249         0
>>>               0
>>> MigrationStage                    0         0          94201         0
>>>               0
>>> AntiEntropyStage                  0         0         160505         0
>>>               0
>>> PendingRangeCalculator            0         0             30         0
>>>               0
>>> Sampler                           0         0              0         0
>>>               0
>>> MemtableFlushWriter               0         0          71270         0
>>>               0
>>> MemtablePostFlush                 0         0         175209         0
>>>               0
>>> MemtableReclaimMemory             0         0          81222         0
>>>               0
>>> Native-Transport-Requests         2         0     1983565628         0
>>>         9405444
>>>
>>> Message type           Dropped
>>> READ                       218
>>> RANGE_SLICE                 15
>>> _TRACE                       0
>>> MUTATION               2949001
>>> COUNTER_MUTATION             0
>>> BINARY                       0
>>> REQUEST_RESPONSE             0
>>> PAGED_RANGE                  0
>>> READ_REPAIR               8571
>>>
>>> I can get a jstack if needed.
>>>
>>>>
>>>> >
>>>> > Rather than troubleshoot this further, what I was thinking about
>>>> doing was:
>>>> > - drop the replication factor on our keyspace to two
>>>>
>>>> Repair before you do this, or you'll lose your consistency guarantees
>>>>
>>>
>>> Given the load on the 2 nodes in rack C I'm hoping a repair will succeed.
>>>
>>>
>>>> > - hopefully this would reduce load on these two remaining nodes
>>>>
>>>> It should, racks awareness guarantees on replica per rack if rf==num
>>>> racks, so right now those 2 c machines have 2.5x as much data as the
>>>> others. This will drop that requirement and drop the load significantly
>>>>
>>>> > - run repairs/cleanup across the cluster
>>>> > - then shoot these two nodes in the 'c' rack
>>>>
>>>> Why shoot the c instances? Why not drop RF and then add 2 more C
>>>> instances, then increase RF back to 3, run repair, then Decom the extra
>>>> instances in a and b?
>>>>
>>>>
>>>> Fair point.  I was considering staying at RF two but I think with your
>>> points below, I should reconsider.
>>>
>>>
>>>> > - run repairs/cleanup across the cluster
>>>> >
>>>> > Would this work with minimal/no disruption?
>>>>
>>>> The big risk of running rf=2 is that quorum==all - any gc pause or node
>>>> restarting will make you lose HA or strong consistency guarantees.
>>>>
>>>> > Should I update their "rack" before hand or after ?
>>>>
>>>> You can't change a node's rack once it's in the cluster, it SHOULD
>>>> refuse to start if you do that
>>>>
>>>> Got it.
>>>
>>>
>>>> > What else am I not thinking about?
>>>> >
>>>> > My main goal atm is to get back to where the cluster is in a clean
>>>> consistent state that allows nodes to properly bootstrap.
>>>> >
>>>> > Thanks for your help in advance.
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>>
>>>>

Re: Dropping down replication factor

Reply via email to