Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Goutham reddy
Thanks Jonathan, I believe we have to reconsider the way analytics have to be performed. On Fri, Jan 4, 2019 at 1:46 PM Jonathan Haddad wrote: > If you absolutely have to use Cassandra as the source of your data, I > agree with Dor. > > That being said, if you're going to be doing a lot of

Re: Cassandra Splitting databases

2019-01-04 Thread Dor Laor
Not sure I understand correctly but if you have one cluster with 2 separate datacenters you can define keyspace A to be on premise with a single DC and keyspace B only on Azure. On Fri, Jan 4, 2019 at 2:23 PM R1 J1 wrote: > We currently have 2 databases (A and B ) on a 6 node cluster. > 3

Re: Cassandra Splitting databases

2019-01-04 Thread Jeff Jirsa
I encourage you to try all of these in a lab/non-prod environment before you do this in production. And take backups. This is risky and you should think about what you're doing before you do it. The most practical way to do this with no downtime is to spin up a new cluster in Azure and either do

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-04 Thread DuyHai Doan
"The problem is I can't know the combination of set/unset values" --> Just for this requirement, Achilles has a working solution for many years using INSERT_NOT_NULL_FIELDS strategy: https://github.com/doanduyhai/Achilles/wiki/Insert-Strategy Or you can use the Update API that by design only

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-04 Thread Jonathan Haddad
If you're overwriting values, it really doesn't matter much if it's a tombstone or any other value, they still need to be compacted and have the same overhead at read time. Tombstones are problematic when you try to use Cassandra as a queue (or something like a queue) and you need to scan over

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-04 Thread Tomas Bartalos
Hello Jon, I thought having tombstones is much higher overhead than just overwriting values. The compaction overhead can be l similar, but I think the read performance is much worse. Tombstones accumulate and hang for 10 days (by default) before they are eligible for compaction. Also we have

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-04 Thread Tomas Bartalos
Hello, I beleive your approach is the same as using spark with " spark.cassandra.output.ignoreNulls=true" This will not cover the situation when a value have to be overwriten with null. I found one possible solution - change the schema to keep only primary key fields and move all other fields to

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-04 Thread Jonathan Haddad
Those are two different cases though. It *sounds like* (again, I may be missing the point) you're trying to overwrite a value with another value. You're either going to serialize a blob and overwrite a single cell, or you're going to overwrite all the cells and include a tombstone. When you do a

Re: [EXTERNAL] Howto avoid tombstones when inserting NULL values

2019-01-04 Thread DuyHai Doan
The idea of storing your data as a single blob can be dangerous. Indeed, you loose the ability to perform atomic update on each column. In Cassandra, LWW is the rule. Suppose 2 concurrent updates on the same row, 1st update changes column Firstname (let's say it's a Person record) and 2nd update

Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Goutham reddy
Hi, We have requirement of heavy data lifting and analytics requirement and decided to go with Apache Spark. In the process we have come up with two patterns a. Apache Spark and Apache Cassandra co-located and shared on same nodes. b. Apache Spark on one independent cluster and Apache Cassandra as

Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Goutham reddy
Thank you very much Dor for the detailed information, yes that should be the primary reason why we have to isolate from Cassandra. Thanks and Regards, Goutham Reddy On Fri, Jan 4, 2019 at 1:29 PM Dor Laor wrote: > I strongly recommend option B, separate clusters. Reasons: > - Networking of

Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Dor Laor
I strongly recommend option B, separate clusters. Reasons: - Networking of node-node is negligible compared to networking within the node - Different scaling considerations Your workload may require 10 Spark nodes and 20 database nodes, so why bundle them? This ratio may also change over

Re: Good way of configuring Apache spark with Apache Cassandra

2019-01-04 Thread Jonathan Haddad
If you absolutely have to use Cassandra as the source of your data, I agree with Dor. That being said, if you're going to be doing a lot of analytics, I recommend using something other than Cassandra with Spark. The performance isn't particularly wonderful and you'll likely get anywhere from

Cassandra Splitting databases

2019-01-04 Thread R1 J1
We currently have 2 databases (A and B ) on a 6 node cluster. 3 nodes are on premise and 3 in azure. I want database A to live on onpremise cluster and I want Database B to stay in the Azure. I want to then split the cluster into 2 clusters one onpremise (3 node ) having Database A and