Impact of a large number of components in column key/cluster key

2019-08-06 Thread Carl Mueller
Say there are 1 vs three vs five vs 8 parts of a column key.

Will range slicing slow down the more parts there are? Will compactions be
impacted?


Re: Differing snitches in different datacenters

2019-08-06 Thread Jean Carlo
Hello Voytek,

In my opinion, It would be better for you to continue using
GossipingPropertyFileSnitch in AWS as well. I would do it to avoid
surprises. I've set up datacenters in AWS using GossipingPropertyFileSnitch
with zero problems.



Jean Carlo

"The best way to predict the future is to invent it" Alan Kay


On Wed, Jul 31, 2019 at 9:06 PM Voytek Jarnot 
wrote:

> Thanks Paul. Yes - finding a definitive answer is where I'm failing as
> well. I think we're probably going to try it and see what happens, but
> that's a bit worrisome.
>
> On Mon, Jul 29, 2019 at 3:35 PM Paul Chandler  wrote:
>
>> Hi Voytek,
>>
>> I looked into this a little while ago, and couldn’t really find a
>> definitive answer. We ended up keeping the GossipingPropertyFileSnitch in
>> our GCP Datacenter, the only downside that I could see is that you have to
>> manually specify the rack and DC. But doing it that way does allow you to
>> create a multi vendor cluster if you wished in the future.
>>
>> I would also be interested if anyone has the definitive answer one this.
>>
>> Thanks
>>
>> Paul
>> www.redshots.com
>>
>> On 29 Jul 2019, at 17:06, Voytek Jarnot  wrote:
>>
>> Just a quick bump - hoping someone can shed some light on whether running
>> different snitches in different datacenters is a terrible idea or no. It'd
>> be fairly temporary, once the new DC is stood up and nodes are rebuilt, the
>> old DC will be decomissioned.
>>
>> On Thu, Jul 25, 2019 at 12:36 PM Voytek Jarnot 
>> wrote:
>>
>>> Quick and hopefully easy question for the list. Background is existing
>>> cluster (1 DC) will be migrated to AWS-hosted cluster via standing up a
>>> second datacenter, existing cluster will be subsequently decommissioned.
>>>
>>> We currently use GossipingPropertyFileSnitch and are thinking about
>>> using Ec2MultiRegionSnitch in the new AWS DC - that'd position us nicely if
>>> in the future we want to run a multi-DC cluster in AWS. My question is: are
>>> there any issues with one DC using GossipingPropertyFileSnitch and the
>>> other using Ec2MultiRegionSnitch? This setup would be temporary, existing
>>> until the new DC nodes have rebuilt and the old DC is decommissioned.
>>>
>>> Thanks,
>>> Voytek Jarnot
>>>
>>
>>


Re: [EXTERNAL] Re: loading big amount of data to Cassandra

2019-08-06 Thread Amanda Moran
With DataStax bulkloader you can only export from a Cassandra table but not 
import into Cassandra (only load into DSE cluster). 

And +1 on the confusing name of batches ... yes it’s for writes but not for 
loading data. 

Amanda 

> On Aug 5, 2019, at 8:14 AM, Durity, Sean R  
> wrote:
> 
> DataStax has a very fast bulk load tool - dsebulk. Not sure if it is 
> available for open source or not. In my experience so far, I am very 
> impressed with it.
> 
> 
> 
> Sean Durity – Staff Systems Engineer, Cassandra
> 
> -Original Message-
> From: p...@xvalheru.org 
> Sent: Saturday, August 3, 2019 6:06 AM
> To: user@cassandra.apache.org
> Cc: Dimo Velev 
> Subject: [EXTERNAL] Re: loading big amount of data to Cassandra
> 
> Thanks to all,
> 
> I'll try the SSTables.
> 
> Thanks
> 
> Pat
> 
>> On 2019-08-03 09:54, Dimo Velev wrote:
>> Check out the CQLSSTableWriter java class -
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A=
>> . You use it to generate sstables - you need to write a small program
>> for that. You can then stream them over the network using the
>> sstableloader (either use the utility or use the underlying classes to
>> embed it in your program).
>> 
>>> On 3. Aug 2019, at 07:17, Ayub M  wrote:
>>> 
>>> Dimo, how do you generate sstables? Do you mean load data locally on
>>> a cassandra node and use sstableloader?
>>> 
>>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev 
>>> wrote:
>>> 
 Hi,
 
 Batches will actually slow down the process because they mean a
 different thing in C* - as you read they are just grouping changes
 together that you want executed atomically.
 
 Cassandra does not really have indices so that is different than a
 relational DB. However, after writing stuff to Cassandra it
 generates many smallish partitions of the data. These are then
 joined in the background together to improve read performance.
 
 You have two options from my experience:
 
 Option 1: use normal CQL api in async mode. This will create a
 high CPU load on your cluster. Depending on whether that is fine
 for you that might be the easiest solution.
 
 Option 2: generate sstables locally and use the sstableloader to
 upload them into the cluster. The streaming does not generate high
 cpu load so it is a viable option for clusters with other
 operational load.
 
 Option 2 scales with the number of cores of the machine generating
 the sstables. If you can split your data you can generate sstables
 on multiple machines. In contrast, option 1 scales with your
 cluster. If you have a large cluster that is idling, it would be
 better to use option 1.
 
 With both options I was able to write at about 50-100K rows / sec
 on my laptop and local Cassandra. The speed heavily depends on the
 size of your rows.
 
 Back to your question — I guess option2 is similar to what you
 are used to from tools like sqlloader for relational DBMSes
 
 I had a requirement of loading a few 100 mio rows per day into an
 operational cluster so I went with option 2 to offload the cpu
 load to reduce impact on the reading side during the loads.
 
 Cheers,
 Dimo
 
 Sent from my iPad
 
> On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote:
> 
> Hi,
> 
> I need to upload to Cassandra about 7 billions of records. What
 is the best setup of Cassandra for this task? Will usage of batch
 speeds up the upload (I've read somewhere that batch in Cassandra
 is dedicated to atomicity not to speeding up communication)? How
 Cassandra internally works related to indexing? In SQL databases
 when uploading such amount of data is suggested to turn off
 indexing and then turn on. Is something simmillar possible in
 Cassandra?
> 
> Thanks for all suggestions.
> 
> Pat
> 
> 
> Freehosting PIPNI - 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U=
> 
> 
> 
 
>>> 
>> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 
 
 
>>> 
>> -
 To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
 For additional commands, e-mail: user-h...@cassandra.apache.org
>> 
>> 

Re: Rebuilding a node without clients hitting it

2019-08-06 Thread Alexander Dejanovski
Hi Cyril,

it will depend on the load balancing policy that is used in the client code.

If you're only accessing DC1, with the node being rebuilt living in DC2,
then you need your clients to be using the DCAwareRoundRobinPolicy to
restrict connections to DC1 and avoid all kind of queries hitting DC2.
If clients are accessing both datacenters, and you're not using the
TokenAwarePolicy, even with LOCAL_ONE, the coordinator could pick the node
being rebuilt to process the query.

If you're not spinning up a new datacenter in an existing cluster,
rebuilding a node is not the best way to achieve this without compromising
consistency.
The node should be replaced, which will make it bootstrap safely (he can
replace himself, using the
"-Dcassandra.replace_address_first_boot=" flag.
Bootstrap lets the node stream the data it needs faster than repair would,
while keeping it out of read requests.
The procedure is to stop Cassandra, wipe data, commit log and saved caches,
and then restart it with the JVM flag set in cassandra-env.sh. The node
will appear as joining or down while bootstrapping (it depends if it
replaces itself or another node, can't remember the specifics).
If it shows up as down, it will rely on hints to get the writes. If it
shows as joining, it will get the writes while streaming is ongoing.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Tue, Aug 6, 2019 at 12:03 PM Cyril Scetbon  wrote:

> Can you elaborate on that ? We use GPFS
> without cassandra-topology.properties.
> —
> Cyril Scetbon
>
> On Aug 5, 2019, at 11:23 PM, Jeff Jirsa  wrote:
>
> some snitch trickery (setting the badness for the rebuilding host) via jmx
>
>
>


Re: Rebuilding a node without clients hitting it

2019-08-06 Thread Cyril Scetbon
We have clients in all our DCs.

Rebuild has always been much faster for us than repairs. It operates as 
bootstrap by streaming data from only one source replica for each token range 
(need to do a cleanup if run multiple times). Repair is a different operation 
and is not supposed to be run on an empty node. It does more processing, merkle 
trees comparison, delete tombstones etc…

We use repairs when we add a DC as a new replication source to a Keyspace.
—
Cyril Scetbon

> Assuming the rebuild is happening on a node in another DC, then there should 
> not be an issue if you are using LOCAL_ONE. If the node is in the local DC 
> (i.e., same DC as the client), I am inclined to think repair would be more 
> appropriate than rebuild but I am not 100% certain.
> -- 
> 
> - John



Re: Rebuilding a node without clients hitting it

2019-08-06 Thread Cyril Scetbon
Can you elaborate on that ? We use GPFS without cassandra-topology.properties.
—
Cyril Scetbon

> On Aug 5, 2019, at 11:23 PM, Jeff Jirsa  wrote:
> 
> some snitch trickery (setting the badness for the rebuilding host) via jmx 



Re: [EXTERNAL] Re: loading big amount of data to Cassandra

2019-08-06 Thread Hiroyuki Yamada
cassandra-loader is also useful because you don't need to create sstables.
https://github.com/brianmhess/cassandra-loader

Hiro

On Tue, Aug 6, 2019 at 12:15 AM Durity, Sean R
 wrote:
>
> DataStax has a very fast bulk load tool - dsebulk. Not sure if it is 
> available for open source or not. In my experience so far, I am very 
> impressed with it.
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
> -Original Message-
> From: p...@xvalheru.org 
> Sent: Saturday, August 3, 2019 6:06 AM
> To: user@cassandra.apache.org
> Cc: Dimo Velev 
> Subject: [EXTERNAL] Re: loading big amount of data to Cassandra
>
> Thanks to all,
>
> I'll try the SSTables.
>
> Thanks
>
> Pat
>
> On 2019-08-03 09:54, Dimo Velev wrote:
> > Check out the CQLSSTableWriter java class -
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A=
> > . You use it to generate sstables - you need to write a small program
> > for that. You can then stream them over the network using the
> > sstableloader (either use the utility or use the underlying classes to
> > embed it in your program).
> >
> > On 3. Aug 2019, at 07:17, Ayub M  wrote:
> >
> >> Dimo, how do you generate sstables? Do you mean load data locally on
> >> a cassandra node and use sstableloader?
> >>
> >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev 
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> Batches will actually slow down the process because they mean a
> >>> different thing in C* - as you read they are just grouping changes
> >>> together that you want executed atomically.
> >>>
> >>> Cassandra does not really have indices so that is different than a
> >>> relational DB. However, after writing stuff to Cassandra it
> >>> generates many smallish partitions of the data. These are then
> >>> joined in the background together to improve read performance.
> >>>
> >>> You have two options from my experience:
> >>>
> >>> Option 1: use normal CQL api in async mode. This will create a
> >>> high CPU load on your cluster. Depending on whether that is fine
> >>> for you that might be the easiest solution.
> >>>
> >>> Option 2: generate sstables locally and use the sstableloader to
> >>> upload them into the cluster. The streaming does not generate high
> >>> cpu load so it is a viable option for clusters with other
> >>> operational load.
> >>>
> >>> Option 2 scales with the number of cores of the machine generating
> >>> the sstables. If you can split your data you can generate sstables
> >>> on multiple machines. In contrast, option 1 scales with your
> >>> cluster. If you have a large cluster that is idling, it would be
> >>> better to use option 1.
> >>>
> >>> With both options I was able to write at about 50-100K rows / sec
> >>> on my laptop and local Cassandra. The speed heavily depends on the
> >>> size of your rows.
> >>>
> >>> Back to your question — I guess option2 is similar to what you
> >>> are used to from tools like sqlloader for relational DBMSes
> >>>
> >>> I had a requirement of loading a few 100 mio rows per day into an
> >>> operational cluster so I went with option 2 to offload the cpu
> >>> load to reduce impact on the reading side during the loads.
> >>>
> >>> Cheers,
> >>> Dimo
> >>>
> >>> Sent from my iPad
> >>>
>  On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote:
> 
>  Hi,
> 
>  I need to upload to Cassandra about 7 billions of records. What
> >>> is the best setup of Cassandra for this task? Will usage of batch
> >>> speeds up the upload (I've read somewhere that batch in Cassandra
> >>> is dedicated to atomicity not to speeding up communication)? How
> >>> Cassandra internally works related to indexing? In SQL databases
> >>> when uploading such amount of data is suggested to turn off
> >>> indexing and then turn on. Is something simmillar possible in
> >>> Cassandra?
> 
>  Thanks for all suggestions.
> 
>  Pat
> 
>  
>  Freehosting PIPNI - 
>  https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U=
> 
> 
> 
> >>>
> >>
> > -
>  To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>  For additional commands, e-mail: user-h...@cassandra.apache.org
> 
> >>>
> >>>
> >>
> > -
> >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >>> For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >