Re: sstableloader: How much does it actually need?

2020-02-07 Thread Reid Pinchback
Just mulling this based on some code and log digging I was doing while trying 
to have Reaper stay on top of our cluster.

I think maybe the caveat here relates to eventual consistency.  C* doesn’t do 
state changes as distributed transactions.  The assumption here is that RF=3 is 
implying that at any given instant in real time, either the data is visible 
nowhere, or it is visible in 3 places.  That’s a conceptual simplification but 
not a real time invariant when you don’t have a transactional horizon to 
perfectly determine visibility of data.

When you have C* usage antipatterns like a client that is determined to read 
back data that it just wrote, as though there was a session context that 
somehow provided repeatable read guarantees, under the covers in the logs you 
can see C* fighting to do on-the-fly repairs to push through the requested 
level of consistency before responding to the query.  Which means, for some 
period of time, that achieving consistency was still work in flight.

I’ve also read about some boundary screw cases like drift in time resolution 
between servers creating the opportunity for stale data, and repairs I think 
would fix that. I haven’t tested the scenario though, so I’m not sure how real 
the situation is.

Bottom line though, minus repairs, I think having all the nodes is getting you 
all your chances to repair the problems.  And if the data is mutating as you 
are grabbing it, the entire frontier of changes is ‘minus repairs’.  Since 
tokens are distributed somewhat randomly, you don’t know where you need to make 
up the differences after.

That’s about as far as my navel gazing goes on that.

From: manish khandelwal 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, February 7, 2020 at 12:22 AM
To: "user@cassandra.apache.org" 
Subject: Re: sstableloader: How much does it actually need?

Message from External Sender
Yes you will have all the data in two nodes provided there is no mutation drop 
at node level or data is repaired

For example if you data A,B,C and D. with RF=3 and 4 nodes (node1, node2, node3 
and node4)

Data A is in node1, node2 and node3
Data B is in node2, node3, and node4
Data C is in node3, node4 and node1
Data D is in node4, node1 and node2

With this configuration, any two nodes combined will give all the data.


Regards
Manish

On Fri, Feb 7, 2020 at 12:53 AM Voytek Jarnot 
mailto:voytek.jar...@gmail.com>> wrote:
Been thinking about it, and I can't really see how with 4 nodes and RF=3, any 2 
nodes would *not* have all the data; but am more than willing to learn.

On the other thing: that's an attractive option, but in our case, the target 
cluster will likely come into use before the source-cluster data is available 
to load. Seemed to me the safest approach was sstableloader.

Thanks

On Wed, Feb 5, 2020 at 6:56 PM Erick Ramirez 
mailto:flightc...@gmail.com>> wrote:
Unfortunately, there isn't a guarantee that 2 nodes alone will have the full 
copy of data. I'd rather not say "it depends". 

TIP: If the nodes in the target cluster have identical tokens allocated, you 
can just do a straight copy of the sstables node-for-node then do nodetool 
refresh. If the target cluster is already built and you can't assign the same 
tokens then sstableloader is your only option. Cheers!

P.S. No need to apologise for asking questions. That's what we're all here for. 
Just keep them coming. 


Re: sstableloader: How much does it actually need?

2020-02-06 Thread manish khandelwal
Yes you will have all the data in two nodes provided there is no mutation
drop at node level or data is repaired

For example if you data A,B,C and D. with RF=3 and 4 nodes (node1, node2,
node3 and node4)

Data A is in node1, node2 and node3
Data B is in node2, node3, and node4
Data C is in node3, node4 and node1
Data D is in node4, node1 and node2

With this configuration, any *two nodes combined* will give all the data.


Regards
Manish

On Fri, Feb 7, 2020 at 12:53 AM Voytek Jarnot 
wrote:

> Been thinking about it, and I can't really see how with 4 nodes and RF=3,
> any 2 nodes would *not* have all the data; but am more than willing to
> learn.
>
> On the other thing: that's an attractive option, but in our case, the
> target cluster will likely come into use before the source-cluster data is
> available to load. Seemed to me the safest approach was sstableloader.
>
> Thanks
>
> On Wed, Feb 5, 2020 at 6:56 PM Erick Ramirez  wrote:
>
>> Unfortunately, there isn't a guarantee that 2 nodes alone will have the
>> full copy of data. I'd rather not say "it depends". 
>>
>> TIP: If the nodes in the target cluster have identical tokens allocated,
>> you can just do a straight copy of the sstables node-for-node then do 
>> nodetool
>> refresh. If the target cluster is already built and you can't assign the
>> same tokens then sstableloader is your only option. Cheers!
>>
>> P.S. No need to apologise for asking questions. That's what we're all
>> here for. Just keep them coming. 
>>
>>>


Re: sstableloader: How much does it actually need?

2020-02-06 Thread Voytek Jarnot
Been thinking about it, and I can't really see how with 4 nodes and RF=3,
any 2 nodes would *not* have all the data; but am more than willing to
learn.

On the other thing: that's an attractive option, but in our case, the
target cluster will likely come into use before the source-cluster data is
available to load. Seemed to me the safest approach was sstableloader.

Thanks

On Wed, Feb 5, 2020 at 6:56 PM Erick Ramirez  wrote:

> Unfortunately, there isn't a guarantee that 2 nodes alone will have the
> full copy of data. I'd rather not say "it depends". 
>
> TIP: If the nodes in the target cluster have identical tokens allocated,
> you can just do a straight copy of the sstables node-for-node then do nodetool
> refresh. If the target cluster is already built and you can't assign the
> same tokens then sstableloader is your only option. Cheers!
>
> P.S. No need to apologise for asking questions. That's what we're all here
> for. Just keep them coming. 
>
>>


Re: sstableloader: How much does it actually need?

2020-02-05 Thread Erick Ramirez
>
> Another option is the DSE-bulk loader but it will require to convert to
> csv/json (good option if you don't like to play with sstableloader and deal
> to get all the sstables from all the nodes)
> https://docs.datastax.com/en/dsbulk/doc/index.html
>

Thanks, Sergio. The DataStax Bulk Loader was developed for a completely
different use case. It doesn't really make sense to go through trouble of
converting the SSTables to CSV/JSON when you've already got the SSTables to
begin with. ☺

It was really designed for loading/unloading data from non-C* sources as a
replacement for the COPY command. Cheers!


Re: sstableloader: How much does it actually need?

2020-02-05 Thread Dor Laor
Another option is to use the Spark migrator, it reads a source CQL cluster and
writes to another. It has a validation stage that compares a full scan
and reports the diff:
https://github.com/scylladb/scylla-migrator

There are many more ways to clone a cluster. My main recommendation is
to 'optimize'
for correctness and simplicity first and only last optimize for
performance/time. Eventually
machine time for such rare operation is cheap, engineering time is
expensive and data
inconsistency is priceless..

On Wed, Feb 5, 2020 at 5:24 PM Sergio  wrote:
>
> Another option is the DSE-bulk loader but it will require to convert to 
> csv/json (good option if you don't like to play with sstableloader and deal 
> to get all the sstables from all the nodes)
> https://docs.datastax.com/en/dsbulk/doc/index.html
>
> Cheers
>
> Sergio
>
> Il giorno mer 5 feb 2020 alle ore 16:56 Erick Ramirez  
> ha scritto:
>>
>> Unfortunately, there isn't a guarantee that 2 nodes alone will have the full 
>> copy of data. I'd rather not say "it depends".
>>
>> TIP: If the nodes in the target cluster have identical tokens allocated, you 
>> can just do a straight copy of the sstables node-for-node then do nodetool 
>> refresh. If the target cluster is already built and you can't assign the 
>> same tokens then sstableloader is your only option. Cheers!
>>
>> P.S. No need to apologise for asking questions. That's what we're all here 
>> for. Just keep them coming.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: sstableloader: How much does it actually need?

2020-02-05 Thread Sergio
Another option is the DSE-bulk loader but it will require to convert to
csv/json (good option if you don't like to play with sstableloader and deal
to get all the sstables from all the nodes)
https://docs.datastax.com/en/dsbulk/doc/index.html

Cheers

Sergio

Il giorno mer 5 feb 2020 alle ore 16:56 Erick Ramirez 
ha scritto:

> Unfortunately, there isn't a guarantee that 2 nodes alone will have the
> full copy of data. I'd rather not say "it depends". 
>
> TIP: If the nodes in the target cluster have identical tokens allocated,
> you can just do a straight copy of the sstables node-for-node then do nodetool
> refresh. If the target cluster is already built and you can't assign the
> same tokens then sstableloader is your only option. Cheers!
>
> P.S. No need to apologise for asking questions. That's what we're all here
> for. Just keep them coming. 
>
>>


Re: sstableloader: How much does it actually need?

2020-02-05 Thread Erick Ramirez
Unfortunately, there isn't a guarantee that 2 nodes alone will have the
full copy of data. I'd rather not say "it depends". 

TIP: If the nodes in the target cluster have identical tokens allocated,
you can just do a straight copy of the sstables node-for-node then do nodetool
refresh. If the target cluster is already built and you can't assign the
same tokens then sstableloader is your only option. Cheers!

P.S. No need to apologise for asking questions. That's what we're all here
for. Just keep them coming. 

>


sstableloader: How much does it actually need?

2020-02-05 Thread Voytek Jarnot
Scenario: Cassandra 3.11.x, 4 nodes, RF=3; moving to identically-sized
cluster via snapshots and sstableloader.

As far as I can tell, in the topology given above, any 2 nodes contain all
of the data. In terms of migrating this cluster, would there be any
downsides or risks with snapshotting and loading (sstableloader) only 2 of
the nodes rather than all 4?

Apologies for the spate of hypotheticals lately, this project is making
life interesting.

Thanks,
Voytek Jarnot