Scrub really shouldn’t be required here. 

If there’s ever a step that reports corruption, it’s either a very very old 
table where you dropped columns previously or did something “wrong” in the past 
or a software bug. The old dropped column really should be obvious in the stack 
trace - anything else deserves a bug report.

It’s unfortunate that people jump to just scrubbing the unreadable data - would 
appreciate an anonymized JIRA if possible. Alternatively work with your vendor 
to make sure they don’t have bugs in their readers somehow. 




> On Nov 29, 2019, at 8:58 PM, Shishir Kumar <shishirroy2...@gmail.com> wrote:
> 
> 
> Some more background. We are planning (tested) binary upgrade across all 
> nodes without downtime. As next step running upgradesstables. As C* file 
> format and version (from format big, version mc to format bti, version aa 
> (Refer 
> https://docs.datastax.com/en/dse/6.0/dse-admin/datastax_enterprise/tools/toolsSStables/ToolsSSTableupgrade.html
>  - upgrade from DSE 5.1 to 6.x). Underlying changes explains why it takes too 
> much time to upgrade. 
> Running  upgradesstables  in parallel across RAC - This is where I am not 
> sure on impact of running in parallel (document recommends to run one node at 
> time). During upgradesstables there are scenario's where it report file 
> corruption, hence require corrective step I.e. scrub. Due to file corruption 
> at times nodes goes down due to sstable corruption or result in high CPU 
> usage ~100%. Performing above in parallel without downtime might result in 
> more inconsistency across nodes. This scenario have not tested, so will need 
> group help in case they have done similar upgrade in past (I.e. 
> scenario's/complexity which needs to be considered and why guideline 
> recommend to run upgradesstable one node at time).
> -Shishir
> 
>> On Fri, Nov 29, 2019 at 11:52 PM Josh Snyder <j...@code406.com> wrote:
>> Hello Shishir,
>> 
>> It shouldn't be necessary to take downtime to perform upgrades of a 
>> Cassandra cluster. It sounds like the biggest issue you're facing is the 
>> upgradesstables step. upgradesstables is not strictly necessary before a 
>> Cassandra node re-enters the cluster to serve traffic; in my experience it 
>> is purely for optimizing the performance of the database once the software 
>> upgrade is complete. I recommend trying out an upgrade in a test environment 
>> without using upgradesstables, which should bring the 5 hours per node down 
>> to just a few minutes.
>> 
>> If you're running NetworkTopologyStrategy and you want to optimize further, 
>> you could consider performing the upgrade on multiple nodes within the same 
>> rack in parallel. When correctly configured, NetworkTopologyStrategy can 
>> protect your database from an outage of an entire rack. So performing an 
>> upgrade on a few nodes at a time within a rack is the same as a partial rack 
>> outage, from the database's perspective.
>> 
>> Have a nice upgrade!
>> 
>> Josh
>> 
>>> On Fri, Nov 29, 2019 at 7:22 AM Shishir Kumar <shishirroy2...@gmail.com> 
>>> wrote:
>>> Hi,
>>> 
>>> Need input on cassandra upgrade strategy for below:
>>> 1. We have Datacenter across 4 geography (multiple isolated deployments in 
>>> each DC).
>>> 2. Number of Cassandra nodes in each deployment is between 6 to 24
>>> 3. Data volume on each nodes between 150 to 400 GB
>>> 4. All production environment has DR set up
>>> 5. During upgrade we do not want downtime 
>>> 
>>> We are planning to go for stack upgrade but upgradesstables is taking 
>>> approx. 5 hours per node (if data volume is approx 200 GB). 
>>> Options- 
>>> No downtime - As per recommendation (DataStax documentation) if we plan to 
>>> upgrade one node at time I.e. in sequence upgrade cycle for one environment 
>>> will take weeks, so DevOps concern.
>>> Read Only (No downtime) - Route read only load to DR system. We have 
>>> resilience built up to take care of mutation scenarios. But incase it takes 
>>> more than say 3-4 hours, there will be long catch up exercise. Maintenance 
>>> cost seems too high due to unknowns 
>>> Downtime- Can upgrade all nodes in parallel as no live customers. This has 
>>> direct Customer impact, so need to convince on maintenance cost vs customer 
>>> impact.
>>> Please suggest how other Organisation are solving this scenario (whom have 
>>> 100+ nodes)
>>> 
>>> Regards 
>>> Shishir 
>>> 

Reply via email to