Re: sstableloader: How much does it actually need?
Just mulling this based on some code and log digging I was doing while trying to have Reaper stay on top of our cluster. I think maybe the caveat here relates to eventual consistency. C* doesn’t do state changes as distributed transactions. The assumption here is that RF=3 is implying that at any given instant in real time, either the data is visible nowhere, or it is visible in 3 places. That’s a conceptual simplification but not a real time invariant when you don’t have a transactional horizon to perfectly determine visibility of data. When you have C* usage antipatterns like a client that is determined to read back data that it just wrote, as though there was a session context that somehow provided repeatable read guarantees, under the covers in the logs you can see C* fighting to do on-the-fly repairs to push through the requested level of consistency before responding to the query. Which means, for some period of time, that achieving consistency was still work in flight. I’ve also read about some boundary screw cases like drift in time resolution between servers creating the opportunity for stale data, and repairs I think would fix that. I haven’t tested the scenario though, so I’m not sure how real the situation is. Bottom line though, minus repairs, I think having all the nodes is getting you all your chances to repair the problems. And if the data is mutating as you are grabbing it, the entire frontier of changes is ‘minus repairs’. Since tokens are distributed somewhat randomly, you don’t know where you need to make up the differences after. That’s about as far as my navel gazing goes on that. From: manish khandelwal Reply-To: "user@cassandra.apache.org" Date: Friday, February 7, 2020 at 12:22 AM To: "user@cassandra.apache.org" Subject: Re: sstableloader: How much does it actually need? Message from External Sender Yes you will have all the data in two nodes provided there is no mutation drop at node level or data is repaired For example if you data A,B,C and D. with RF=3 and 4 nodes (node1, node2, node3 and node4) Data A is in node1, node2 and node3 Data B is in node2, node3, and node4 Data C is in node3, node4 and node1 Data D is in node4, node1 and node2 With this configuration, any two nodes combined will give all the data. Regards Manish On Fri, Feb 7, 2020 at 12:53 AM Voytek Jarnot mailto:voytek.jar...@gmail.com>> wrote: Been thinking about it, and I can't really see how with 4 nodes and RF=3, any 2 nodes would *not* have all the data; but am more than willing to learn. On the other thing: that's an attractive option, but in our case, the target cluster will likely come into use before the source-cluster data is available to load. Seemed to me the safest approach was sstableloader. Thanks On Wed, Feb 5, 2020 at 6:56 PM Erick Ramirez mailto:flightc...@gmail.com>> wrote: Unfortunately, there isn't a guarantee that 2 nodes alone will have the full copy of data. I'd rather not say "it depends". TIP: If the nodes in the target cluster have identical tokens allocated, you can just do a straight copy of the sstables node-for-node then do nodetool refresh. If the target cluster is already built and you can't assign the same tokens then sstableloader is your only option. Cheers! P.S. No need to apologise for asking questions. That's what we're all here for. Just keep them coming.
Re: sstableloader: How much does it actually need?
Yes you will have all the data in two nodes provided there is no mutation drop at node level or data is repaired For example if you data A,B,C and D. with RF=3 and 4 nodes (node1, node2, node3 and node4) Data A is in node1, node2 and node3 Data B is in node2, node3, and node4 Data C is in node3, node4 and node1 Data D is in node4, node1 and node2 With this configuration, any *two nodes combined* will give all the data. Regards Manish On Fri, Feb 7, 2020 at 12:53 AM Voytek Jarnot wrote: > Been thinking about it, and I can't really see how with 4 nodes and RF=3, > any 2 nodes would *not* have all the data; but am more than willing to > learn. > > On the other thing: that's an attractive option, but in our case, the > target cluster will likely come into use before the source-cluster data is > available to load. Seemed to me the safest approach was sstableloader. > > Thanks > > On Wed, Feb 5, 2020 at 6:56 PM Erick Ramirez wrote: > >> Unfortunately, there isn't a guarantee that 2 nodes alone will have the >> full copy of data. I'd rather not say "it depends". >> >> TIP: If the nodes in the target cluster have identical tokens allocated, >> you can just do a straight copy of the sstables node-for-node then do >> nodetool >> refresh. If the target cluster is already built and you can't assign the >> same tokens then sstableloader is your only option. Cheers! >> >> P.S. No need to apologise for asking questions. That's what we're all >> here for. Just keep them coming. >> >>>
Re: sstableloader: How much does it actually need?
Been thinking about it, and I can't really see how with 4 nodes and RF=3, any 2 nodes would *not* have all the data; but am more than willing to learn. On the other thing: that's an attractive option, but in our case, the target cluster will likely come into use before the source-cluster data is available to load. Seemed to me the safest approach was sstableloader. Thanks On Wed, Feb 5, 2020 at 6:56 PM Erick Ramirez wrote: > Unfortunately, there isn't a guarantee that 2 nodes alone will have the > full copy of data. I'd rather not say "it depends". > > TIP: If the nodes in the target cluster have identical tokens allocated, > you can just do a straight copy of the sstables node-for-node then do nodetool > refresh. If the target cluster is already built and you can't assign the > same tokens then sstableloader is your only option. Cheers! > > P.S. No need to apologise for asking questions. That's what we're all here > for. Just keep them coming. > >>
Re: sstableloader: How much does it actually need?
> > Another option is the DSE-bulk loader but it will require to convert to > csv/json (good option if you don't like to play with sstableloader and deal > to get all the sstables from all the nodes) > https://docs.datastax.com/en/dsbulk/doc/index.html > Thanks, Sergio. The DataStax Bulk Loader was developed for a completely different use case. It doesn't really make sense to go through trouble of converting the SSTables to CSV/JSON when you've already got the SSTables to begin with. ☺ It was really designed for loading/unloading data from non-C* sources as a replacement for the COPY command. Cheers!
Re: sstableloader: How much does it actually need?
Another option is to use the Spark migrator, it reads a source CQL cluster and writes to another. It has a validation stage that compares a full scan and reports the diff: https://github.com/scylladb/scylla-migrator There are many more ways to clone a cluster. My main recommendation is to 'optimize' for correctness and simplicity first and only last optimize for performance/time. Eventually machine time for such rare operation is cheap, engineering time is expensive and data inconsistency is priceless.. On Wed, Feb 5, 2020 at 5:24 PM Sergio wrote: > > Another option is the DSE-bulk loader but it will require to convert to > csv/json (good option if you don't like to play with sstableloader and deal > to get all the sstables from all the nodes) > https://docs.datastax.com/en/dsbulk/doc/index.html > > Cheers > > Sergio > > Il giorno mer 5 feb 2020 alle ore 16:56 Erick Ramirez > ha scritto: >> >> Unfortunately, there isn't a guarantee that 2 nodes alone will have the full >> copy of data. I'd rather not say "it depends". >> >> TIP: If the nodes in the target cluster have identical tokens allocated, you >> can just do a straight copy of the sstables node-for-node then do nodetool >> refresh. If the target cluster is already built and you can't assign the >> same tokens then sstableloader is your only option. Cheers! >> >> P.S. No need to apologise for asking questions. That's what we're all here >> for. Just keep them coming. - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: sstableloader: How much does it actually need?
Another option is the DSE-bulk loader but it will require to convert to csv/json (good option if you don't like to play with sstableloader and deal to get all the sstables from all the nodes) https://docs.datastax.com/en/dsbulk/doc/index.html Cheers Sergio Il giorno mer 5 feb 2020 alle ore 16:56 Erick Ramirez ha scritto: > Unfortunately, there isn't a guarantee that 2 nodes alone will have the > full copy of data. I'd rather not say "it depends". > > TIP: If the nodes in the target cluster have identical tokens allocated, > you can just do a straight copy of the sstables node-for-node then do nodetool > refresh. If the target cluster is already built and you can't assign the > same tokens then sstableloader is your only option. Cheers! > > P.S. No need to apologise for asking questions. That's what we're all here > for. Just keep them coming. > >>
Re: sstableloader: How much does it actually need?
Unfortunately, there isn't a guarantee that 2 nodes alone will have the full copy of data. I'd rather not say "it depends". TIP: If the nodes in the target cluster have identical tokens allocated, you can just do a straight copy of the sstables node-for-node then do nodetool refresh. If the target cluster is already built and you can't assign the same tokens then sstableloader is your only option. Cheers! P.S. No need to apologise for asking questions. That's what we're all here for. Just keep them coming. >
sstableloader: How much does it actually need?
Scenario: Cassandra 3.11.x, 4 nodes, RF=3; moving to identically-sized cluster via snapshots and sstableloader. As far as I can tell, in the topology given above, any 2 nodes contain all of the data. In terms of migrating this cluster, would there be any downsides or risks with snapshotting and loading (sstableloader) only 2 of the nodes rather than all 4? Apologies for the spate of hypotheticals lately, this project is making life interesting. Thanks, Voytek Jarnot