Scrub really shouldn’t be required here. If there’s ever a step that reports corruption, it’s either a very very old table where you dropped columns previously or did something “wrong” in the past or a software bug. The old dropped column really should be obvious in the stack trace - anything else deserves a bug report.
It’s unfortunate that people jump to just scrubbing the unreadable data - would appreciate an anonymized JIRA if possible. Alternatively work with your vendor to make sure they don’t have bugs in their readers somehow. > On Nov 29, 2019, at 8:58 PM, Shishir Kumar <shishirroy2...@gmail.com> wrote: > > > Some more background. We are planning (tested) binary upgrade across all > nodes without downtime. As next step running upgradesstables. As C* file > format and version (from format big, version mc to format bti, version aa > (Refer > https://docs.datastax.com/en/dse/6.0/dse-admin/datastax_enterprise/tools/toolsSStables/ToolsSSTableupgrade.html > - upgrade from DSE 5.1 to 6.x). Underlying changes explains why it takes too > much time to upgrade. > Running upgradesstables in parallel across RAC - This is where I am not > sure on impact of running in parallel (document recommends to run one node at > time). During upgradesstables there are scenario's where it report file > corruption, hence require corrective step I.e. scrub. Due to file corruption > at times nodes goes down due to sstable corruption or result in high CPU > usage ~100%. Performing above in parallel without downtime might result in > more inconsistency across nodes. This scenario have not tested, so will need > group help in case they have done similar upgrade in past (I.e. > scenario's/complexity which needs to be considered and why guideline > recommend to run upgradesstable one node at time). > -Shishir > >> On Fri, Nov 29, 2019 at 11:52 PM Josh Snyder <j...@code406.com> wrote: >> Hello Shishir, >> >> It shouldn't be necessary to take downtime to perform upgrades of a >> Cassandra cluster. It sounds like the biggest issue you're facing is the >> upgradesstables step. upgradesstables is not strictly necessary before a >> Cassandra node re-enters the cluster to serve traffic; in my experience it >> is purely for optimizing the performance of the database once the software >> upgrade is complete. I recommend trying out an upgrade in a test environment >> without using upgradesstables, which should bring the 5 hours per node down >> to just a few minutes. >> >> If you're running NetworkTopologyStrategy and you want to optimize further, >> you could consider performing the upgrade on multiple nodes within the same >> rack in parallel. When correctly configured, NetworkTopologyStrategy can >> protect your database from an outage of an entire rack. So performing an >> upgrade on a few nodes at a time within a rack is the same as a partial rack >> outage, from the database's perspective. >> >> Have a nice upgrade! >> >> Josh >> >>> On Fri, Nov 29, 2019 at 7:22 AM Shishir Kumar <shishirroy2...@gmail.com> >>> wrote: >>> Hi, >>> >>> Need input on cassandra upgrade strategy for below: >>> 1. We have Datacenter across 4 geography (multiple isolated deployments in >>> each DC). >>> 2. Number of Cassandra nodes in each deployment is between 6 to 24 >>> 3. Data volume on each nodes between 150 to 400 GB >>> 4. All production environment has DR set up >>> 5. During upgrade we do not want downtime >>> >>> We are planning to go for stack upgrade but upgradesstables is taking >>> approx. 5 hours per node (if data volume is approx 200 GB). >>> Options- >>> No downtime - As per recommendation (DataStax documentation) if we plan to >>> upgrade one node at time I.e. in sequence upgrade cycle for one environment >>> will take weeks, so DevOps concern. >>> Read Only (No downtime) - Route read only load to DR system. We have >>> resilience built up to take care of mutation scenarios. But incase it takes >>> more than say 3-4 hours, there will be long catch up exercise. Maintenance >>> cost seems too high due to unknowns >>> Downtime- Can upgrade all nodes in parallel as no live customers. This has >>> direct Customer impact, so need to convince on maintenance cost vs customer >>> impact. >>> Please suggest how other Organisation are solving this scenario (whom have >>> 100+ nodes) >>> >>> Regards >>> Shishir >>>