Impact of a large number of components in column key/cluster key
Say there are 1 vs three vs five vs 8 parts of a column key. Will range slicing slow down the more parts there are? Will compactions be impacted?
Re: Differing snitches in different datacenters
Hello Voytek, In my opinion, It would be better for you to continue using GossipingPropertyFileSnitch in AWS as well. I would do it to avoid surprises. I've set up datacenters in AWS using GossipingPropertyFileSnitch with zero problems. Jean Carlo "The best way to predict the future is to invent it" Alan Kay On Wed, Jul 31, 2019 at 9:06 PM Voytek Jarnot wrote: > Thanks Paul. Yes - finding a definitive answer is where I'm failing as > well. I think we're probably going to try it and see what happens, but > that's a bit worrisome. > > On Mon, Jul 29, 2019 at 3:35 PM Paul Chandler wrote: > >> Hi Voytek, >> >> I looked into this a little while ago, and couldn’t really find a >> definitive answer. We ended up keeping the GossipingPropertyFileSnitch in >> our GCP Datacenter, the only downside that I could see is that you have to >> manually specify the rack and DC. But doing it that way does allow you to >> create a multi vendor cluster if you wished in the future. >> >> I would also be interested if anyone has the definitive answer one this. >> >> Thanks >> >> Paul >> www.redshots.com >> >> On 29 Jul 2019, at 17:06, Voytek Jarnot wrote: >> >> Just a quick bump - hoping someone can shed some light on whether running >> different snitches in different datacenters is a terrible idea or no. It'd >> be fairly temporary, once the new DC is stood up and nodes are rebuilt, the >> old DC will be decomissioned. >> >> On Thu, Jul 25, 2019 at 12:36 PM Voytek Jarnot >> wrote: >> >>> Quick and hopefully easy question for the list. Background is existing >>> cluster (1 DC) will be migrated to AWS-hosted cluster via standing up a >>> second datacenter, existing cluster will be subsequently decommissioned. >>> >>> We currently use GossipingPropertyFileSnitch and are thinking about >>> using Ec2MultiRegionSnitch in the new AWS DC - that'd position us nicely if >>> in the future we want to run a multi-DC cluster in AWS. My question is: are >>> there any issues with one DC using GossipingPropertyFileSnitch and the >>> other using Ec2MultiRegionSnitch? This setup would be temporary, existing >>> until the new DC nodes have rebuilt and the old DC is decommissioned. >>> >>> Thanks, >>> Voytek Jarnot >>> >> >>
Re: [EXTERNAL] Re: loading big amount of data to Cassandra
With DataStax bulkloader you can only export from a Cassandra table but not import into Cassandra (only load into DSE cluster). And +1 on the confusing name of batches ... yes it’s for writes but not for loading data. Amanda > On Aug 5, 2019, at 8:14 AM, Durity, Sean R > wrote: > > DataStax has a very fast bulk load tool - dsebulk. Not sure if it is > available for open source or not. In my experience so far, I am very > impressed with it. > > > > Sean Durity – Staff Systems Engineer, Cassandra > > -Original Message- > From: p...@xvalheru.org > Sent: Saturday, August 3, 2019 6:06 AM > To: user@cassandra.apache.org > Cc: Dimo Velev > Subject: [EXTERNAL] Re: loading big amount of data to Cassandra > > Thanks to all, > > I'll try the SSTables. > > Thanks > > Pat > >> On 2019-08-03 09:54, Dimo Velev wrote: >> Check out the CQLSSTableWriter java class - >> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A= >> . You use it to generate sstables - you need to write a small program >> for that. You can then stream them over the network using the >> sstableloader (either use the utility or use the underlying classes to >> embed it in your program). >> >>> On 3. Aug 2019, at 07:17, Ayub M wrote: >>> >>> Dimo, how do you generate sstables? Do you mean load data locally on >>> a cassandra node and use sstableloader? >>> >>> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev >>> wrote: >>> Hi, Batches will actually slow down the process because they mean a different thing in C* - as you read they are just grouping changes together that you want executed atomically. Cassandra does not really have indices so that is different than a relational DB. However, after writing stuff to Cassandra it generates many smallish partitions of the data. These are then joined in the background together to improve read performance. You have two options from my experience: Option 1: use normal CQL api in async mode. This will create a high CPU load on your cluster. Depending on whether that is fine for you that might be the easiest solution. Option 2: generate sstables locally and use the sstableloader to upload them into the cluster. The streaming does not generate high cpu load so it is a viable option for clusters with other operational load. Option 2 scales with the number of cores of the machine generating the sstables. If you can split your data you can generate sstables on multiple machines. In contrast, option 1 scales with your cluster. If you have a large cluster that is idling, it would be better to use option 1. With both options I was able to write at about 50-100K rows / sec on my laptop and local Cassandra. The speed heavily depends on the size of your rows. Back to your question — I guess option2 is similar to what you are used to from tools like sqlloader for relational DBMSes I had a requirement of loading a few 100 mio rows per day into an operational cluster so I went with option 2 to offload the cpu load to reduce impact on the reading side during the loads. Cheers, Dimo Sent from my iPad > On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote: > > Hi, > > I need to upload to Cassandra about 7 billions of records. What is the best setup of Cassandra for this task? Will usage of batch speeds up the upload (I've read somewhere that batch in Cassandra is dedicated to atomicity not to speeding up communication)? How Cassandra internally works related to indexing? In SQL databases when uploading such amount of data is suggested to turn off indexing and then turn on. Is something simmillar possible in Cassandra? > > Thanks for all suggestions. > > Pat > > > Freehosting PIPNI - > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U= > > > >>> >> - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > >>> >> - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org >> >>
Re: Rebuilding a node without clients hitting it
Hi Cyril, it will depend on the load balancing policy that is used in the client code. If you're only accessing DC1, with the node being rebuilt living in DC2, then you need your clients to be using the DCAwareRoundRobinPolicy to restrict connections to DC1 and avoid all kind of queries hitting DC2. If clients are accessing both datacenters, and you're not using the TokenAwarePolicy, even with LOCAL_ONE, the coordinator could pick the node being rebuilt to process the query. If you're not spinning up a new datacenter in an existing cluster, rebuilding a node is not the best way to achieve this without compromising consistency. The node should be replaced, which will make it bootstrap safely (he can replace himself, using the "-Dcassandra.replace_address_first_boot=" flag. Bootstrap lets the node stream the data it needs faster than repair would, while keeping it out of read requests. The procedure is to stop Cassandra, wipe data, commit log and saved caches, and then restart it with the JVM flag set in cassandra-env.sh. The node will appear as joining or down while bootstrapping (it depends if it replaces itself or another node, can't remember the specifics). If it shows up as down, it will rely on hints to get the writes. If it shows as joining, it will get the writes while streaming is ongoing. Cheers, - Alexander Dejanovski France @alexanderdeja Consultant Apache Cassandra Consulting http://www.thelastpickle.com On Tue, Aug 6, 2019 at 12:03 PM Cyril Scetbon wrote: > Can you elaborate on that ? We use GPFS > without cassandra-topology.properties. > — > Cyril Scetbon > > On Aug 5, 2019, at 11:23 PM, Jeff Jirsa wrote: > > some snitch trickery (setting the badness for the rebuilding host) via jmx > > >
Re: Rebuilding a node without clients hitting it
We have clients in all our DCs. Rebuild has always been much faster for us than repairs. It operates as bootstrap by streaming data from only one source replica for each token range (need to do a cleanup if run multiple times). Repair is a different operation and is not supposed to be run on an empty node. It does more processing, merkle trees comparison, delete tombstones etc… We use repairs when we add a DC as a new replication source to a Keyspace. — Cyril Scetbon > Assuming the rebuild is happening on a node in another DC, then there should > not be an issue if you are using LOCAL_ONE. If the node is in the local DC > (i.e., same DC as the client), I am inclined to think repair would be more > appropriate than rebuild but I am not 100% certain. > -- > > - John
Re: Rebuilding a node without clients hitting it
Can you elaborate on that ? We use GPFS without cassandra-topology.properties. — Cyril Scetbon > On Aug 5, 2019, at 11:23 PM, Jeff Jirsa wrote: > > some snitch trickery (setting the badness for the rebuilding host) via jmx
Re: [EXTERNAL] Re: loading big amount of data to Cassandra
cassandra-loader is also useful because you don't need to create sstables. https://github.com/brianmhess/cassandra-loader Hiro On Tue, Aug 6, 2019 at 12:15 AM Durity, Sean R wrote: > > DataStax has a very fast bulk load tool - dsebulk. Not sure if it is > available for open source or not. In my experience so far, I am very > impressed with it. > > > > Sean Durity – Staff Systems Engineer, Cassandra > > -Original Message- > From: p...@xvalheru.org > Sent: Saturday, August 3, 2019 6:06 AM > To: user@cassandra.apache.org > Cc: Dimo Velev > Subject: [EXTERNAL] Re: loading big amount of data to Cassandra > > Thanks to all, > > I'll try the SSTables. > > Thanks > > Pat > > On 2019-08-03 09:54, Dimo Velev wrote: > > Check out the CQLSSTableWriter java class - > > https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_cassandra_blob_trunk_src_java_org_apache_cassandra_io_sstable_CQLSSTableWriter.java=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=F43aPz7NPfAfs5c_oRJQvUiTMJjDmpB_BXAHKhPfW2A= > > . You use it to generate sstables - you need to write a small program > > for that. You can then stream them over the network using the > > sstableloader (either use the utility or use the underlying classes to > > embed it in your program). > > > > On 3. Aug 2019, at 07:17, Ayub M wrote: > > > >> Dimo, how do you generate sstables? Do you mean load data locally on > >> a cassandra node and use sstableloader? > >> > >> On Fri, Aug 2, 2019, 5:48 PM Dimo Velev > >> wrote: > >> > >>> Hi, > >>> > >>> Batches will actually slow down the process because they mean a > >>> different thing in C* - as you read they are just grouping changes > >>> together that you want executed atomically. > >>> > >>> Cassandra does not really have indices so that is different than a > >>> relational DB. However, after writing stuff to Cassandra it > >>> generates many smallish partitions of the data. These are then > >>> joined in the background together to improve read performance. > >>> > >>> You have two options from my experience: > >>> > >>> Option 1: use normal CQL api in async mode. This will create a > >>> high CPU load on your cluster. Depending on whether that is fine > >>> for you that might be the easiest solution. > >>> > >>> Option 2: generate sstables locally and use the sstableloader to > >>> upload them into the cluster. The streaming does not generate high > >>> cpu load so it is a viable option for clusters with other > >>> operational load. > >>> > >>> Option 2 scales with the number of cores of the machine generating > >>> the sstables. If you can split your data you can generate sstables > >>> on multiple machines. In contrast, option 1 scales with your > >>> cluster. If you have a large cluster that is idling, it would be > >>> better to use option 1. > >>> > >>> With both options I was able to write at about 50-100K rows / sec > >>> on my laptop and local Cassandra. The speed heavily depends on the > >>> size of your rows. > >>> > >>> Back to your question — I guess option2 is similar to what you > >>> are used to from tools like sqlloader for relational DBMSes > >>> > >>> I had a requirement of loading a few 100 mio rows per day into an > >>> operational cluster so I went with option 2 to offload the cpu > >>> load to reduce impact on the reading side during the loads. > >>> > >>> Cheers, > >>> Dimo > >>> > >>> Sent from my iPad > >>> > On 2. Aug 2019, at 18:59, p...@xvalheru.org wrote: > > Hi, > > I need to upload to Cassandra about 7 billions of records. What > >>> is the best setup of Cassandra for this task? Will usage of batch > >>> speeds up the upload (I've read somewhere that batch in Cassandra > >>> is dedicated to atomicity not to speeding up communication)? How > >>> Cassandra internally works related to indexing? In SQL databases > >>> when uploading such amount of data is suggested to turn off > >>> indexing and then turn on. Is something simmillar possible in > >>> Cassandra? > > Thanks for all suggestions. > > Pat > > > Freehosting PIPNI - > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pipni.cz_=DwIDaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=0F8VMU_BKNwicZFDQ0Nx54JvvS3MHT92_W1RRwF3deA=nccgCDZwHe3qri11l3VV1if5GR1iqcWR5gjf6-J1C5U= > > > > >>> > >> > > - > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > For additional commands, e-mail: user-h...@cassandra.apache.org > > >>> > >>> > >> > > - > >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org > >>> For additional commands, e-mail: user-h...@cassandra.apache.org > > > >