Potential bug re to MVs after upgrading from 3.11.x to 4.0.x

2023-05-31 Thread Rahul Singh
Good afternoon,

Am looking into some latent issues post 3.11.x to 4.0.x upgrade. The system
is using materialized views and the core problem is related to how the
mutations are being sent from the parent table to two related materialized
views.

In 3.11.x, without any tuning ( no flag set for
*-Dcassandra.mv_enable_coordinator_batchlog * or changes to the concurrent
mv writers, etc. ) , the cluster behaved fine if a node went down. After
the upgrade, there are tons of CL LOCAL_ONE issues as it relates to
acquiring a lock on *every other node that was up*, and eventually *on
every other node the *CPU, network, and memory get saturated until the node
is brought back up.

I've compared the cassandra.yaml, jvm.options etc and don't see anything
especially different.

I see two major code paths that use the ViewManager.*updatesAffectsView* to
determine next steps are in Keyspace.java and in StorageProxy.java. The
StorageProxy code is not being hit from my review.

I've tracked the code down to:
 (
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/db/Keyspace.java#L528
)

And :
https://github.com/apache/cassandra/blob/43ec1843918aba9e81d3c2dc1433a1ef4740a51f/src/java/org/apache/cassandra/db/view/ViewManager.java#L71

```
if (!enableCoordinatorBatchlog && coordinatorBatchlog)
return false;
```


We tried setting the mv coordinator flag to true and it made no difference,
which shouldn't be the case. If a value isn't set it should be defaulting
to false as per Java's Boolean.getBoolean method. *May try setting it to
false and seeing the behavior. *

The strategic recommendation I've made is to move away from MVs to
self-managed views and then eventually make use of SAI if it works. I'm
still curious why it would behave so drastically differently in 4.0.x than
3.11.x.

Has anyone else seen something like this? Am also going to try to recreate
this in a vanilla environment and will report back.

rahul.xavier.si...@gmail.com

http://cassandra.link


Re: Ansible Cassandra Collection

2021-03-19 Thread Rahul Singh
This is great. This is great  I've added the repo to
https://github.com/Anant/awesome-cassandra and cassandra.link and if it's
not already there will get it added to cassandra.tools

We'd love to hear your thoughts on this in the Cassandra Kubernetes SIG
because we are generally tackling the same / similar issues.


Best regards,
Rahul Singh
rahul.xavier.si...@gmail.com

http://cassandra.link



On Fri, Mar 19, 2021 at 9:27 AM Erick Ramirez 
wrote:

> Fantastic, Rhys! Thanks very much. I'm sure it will prove very useful for
> users in the community. Cheers!
>
>>


Re: data modeling qu: use a Map datatype, or just simple rows... ?

2020-09-19 Thread Rahul Singh
Not necessarily. A deterministic hash randomizes a key that may be susceptible 
to “clustering” that also may need to be used in other non Cassandra systems.

This way records can be accessed in both systems while leveraging the 
partitioner in Cassandra without pitfalls.

The same can be done with natural string keys like “email.”

Best regards,
Rahul Singh
From: Sagar Jambhulkar 
Sent: Saturday, September 19, 2020 6:45:25 AM
To: user@cassandra.apache.org ; Attila Wind 

Subject: Re: data modeling qu: use a Map datatype, or just simple rows... ?

Don't really see a difference in two options. Won't the partitioner run on user 
id and create a hash for you? Unless your hash function is better than 
partitioner.

On Fri, 18 Sep 2020, 21:33 Attila Wind,  wrote:
> Hey guys,
> I'm curious about your experiences regarding a data modeling question we are 
> facing with.
> At the moment we see 2 major different approaches in terms of how to build 
> the tables
> But I'm googling around already for days with no luck to find any useful 
> material explaining to me how a Map (as collection datatype) works on the 
> storage engine, and what could surprise us later if we . So decided to ask 
> this question... (If someone has some nice pointers here maybe that is also 
> much appreciated!)
> So
> To describe the problem in a simplified form
>  • Imagine you have users (everyone is identified with a UUID),
>  • and we want to answer a simple question: "have we seen this guy before?"
>  • we "just" want to be able to answer this question for a limited time - 
> let's say for 3 months
>  • but... there are lots of lots of users we run into... many millions / 
> each day...
>  • and ~15-20% of them are returning users only - so many guys we just 
> might see once
> We are thinking about something like a big big Map, in a form of
>     userId => lastSeenTimestamp
>
> Obviously if we would have something like that then answering the above 
> question is simply:
>     if(map.get(userId) != null)  => TRUE - we have seen the guy before
> Regarding the 2 major modelling approaches I mentioned above
>
> Approach 1
> Just simply use a table, something like this
>
> CREATE TABLE IF NOT EXISTS users (
>     user_id            varchar,
>     last_seen        int,                -- a UNIX timestamp is enough, thats 
> why int
>     PRIMARY KEY (user_id)
> ) 
> AND default_time_to_live = <3 months of seconds>;
> Approach 2
>  to do not produce that much rows, "cluster" the guys a bit together (into 1 
> row) so
> introduce a hashing function over the userId, producing a value btw [0; 1]
> and go with a table like
> CREATE TABLE IF NOT EXISTS users (
>     user_id_hash    int,
>     users_seen        map,            -- this is a userId => last 
> timestamp map
>     PRIMARY KEY (user_id_hash)
> ) 
> AND default_time_to_live = <3 months of seconds>;        -- yes, its clearly 
> not a good enough way ...
>
> In theory:
>  • on a WRITE path both representation gives us a way to do the write without 
> the need of read
>  • even the READ path is pretty efficient in both cases
>  • Approach2 is worse definitely when we come to the cleanup - "remove info 
> if older than 3 month"
>  • Approach2 might affect the balance of the cluster more - thats clear 
> (however not that much due to the "law of large number" and really enough 
> random factors)
> And what we are struggling around is: what do you think
> Which approach would be better over time? So will slow down the cluster less 
> considering in compaction etc etc
> As far as we can see the real question is:
> which hurts more?
>  • much more rows, but very small rows (regarding data size), or
>  • much less rows, but much bigger rows (regarding data size)
> ?
> Any thoughts, comments, pointers to some related case studies, articles, etc 
> is highly appreciated!! :-)
> thanks!
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>


Cassandra.Link Knowledge Base - v. 0.7 - Jobs Section

2020-08-05 Thread Rahul Singh
Folks,

Quick update. We added a jobs section that's aggregating jobs from a few
different job markets but only those that relate to Cassandra. Right now
it's just a "lucene" based filter from a larger data set, but our team is
working to put some ML action to do NLP based classification.

Would love to get your feedback. It's not indeed but it is interesting to
see how people are using Cassandra out there and who.

Best,


rahul.xavier.si...@gmail.com

http://cassandra.link


Open Request for Reference Architectures / Case Studies of Apache Cassandra

2020-07-31 Thread Rahul Singh
Folks,

I'm looking for articles, blogs, diagrams, or someone to answer some
questions so that I can help repopulate a *case study database of uses of
Cassandra*.
I'll be publishing these on the  *https://cassandra.link
* site. I'm also collecting reference architectures
so we can improve the Cassandra docs and so that I can create some targets
for the Cassandra Operator for Kubernetes SIG

The old Planet Cassandra site had 226 or so case studies. I've mined most
of the data from the Wayback Machine and have been contacting people on
LinkedIN, but if you are one of those people, or know of a project that is
currently using Cassandra that could be written up about and potentially
with an architecture diagram of how it's built, I'd love to collaborate.

rahul.xavier.si...@gmail.com

http://cassandra.link


Cassandra.Link Knowledge Base - v. 0.5

2020-03-03 Thread Rahul Singh
Our apprentice / analyst team (Tanaka, Jordon, and Cynthia) at Anant.us have 
been making improvements to the best public collection of curated links for 
Apache Cassandra at https://Cassandra.Link — they’ve fixed several issues 
related to the search interface.

This week we’ll also be releasing “Cassandra.Tools”, a list of open and 
commercial tools that have been tested by us with different Variants of Apache 
Cassandra and other software that is data protocol / management protocol 
compliant with Cassandra.

Our goal with these and future sites are to add to the ecosystem and make it 
easier for Cassandra users to learn, master, and propagate knowledge around the 
Apache Cassandra “movement”.

We’re always in “beta” so if you have any feedback feel free to contact me here 
on a PM or reach out to the team through the site.  More specifically if you 
have time , I’d love to know what your challenges are in learning Cassandra. 
The feedback will also be used to help with the Documentation project for 
Apache Cassandra.

rahul.xavier.si...@gmail.com

http://cassandra.link
The Apache Cassandra Knowledge Base.


Cassandra.Link Knowledge Base - v. 0.4

2019-07-20 Thread Rahul Singh
Hey Cassandra community ,

Thanks for all the feedback in the past on my cassandra knowledge base project. 
Without the feedback cycle it’s not really for the community.

V. 0.1 - Awesome Cassandra  readme.me
https://anant.github.io/awesome-cassandra

Hundreds of Cassandra articles tools etc. organized in a table of contents. 
This currently still maintains the official Cassandra.Link redirection.

V. 0.2 - 600+ Organized links in Wallabag exposed as a web interface
http://leaves.anant.us/leaves/#!/?tag=cassandra

V. 0.3 - those 600+ Indexed with some Natural language / entity extraction jazz 
to machine taxonomize


V. 0.4 - 10-15 Blog feeds + organized links as a statically generates site. 
(Search not working but will be. )

https://cassandra.netlify.com/

This last version was made possible by a few of our team members in Washington 
DC and in Delhi : Mohd Danish (Delhi) , Tanaka Mapondera (DC), Rishi Nair 
(DC/VA) and the numerous contributions of the Cassandra community in the form 
of tools, articles, and videos.

We’d love to get folks to check it out and send critical feedback — directly to 
me or do a pull request on the awesome-cassandra repo (this is just for the 
readme with Cassandra knowledge organized in a Table or Contents. )

Best,

rahul.xavier.si...@gmail.com http://cassandra.link


Re: CassKop : a Cassandra operator for Kubernetes developped by Orange

2019-05-24 Thread Rahul Singh
Fantastic! Now there are three teams making k8s operators for C*: Datastax,
Instaclustr, and now Orange.

rahul.xavier.si...@gmail.com

http://cassandra.link

I'm speaking at #DataStaxAccelerate, the world’s premiere #ApacheCassandra
conference, and I want to see you there! Use my code Singh50 for 50% off
your registration. www.datastax.com/accelerate


On Fri, May 24, 2019 at 9:07 AM Jean-Armel Luce  wrote:

> Hi folks,
>
> We are excited to announce that CassKop, a Cassandra operator for
> Kubernetes developped by Orange teams, is now ready for Beta testing.
>
> CassKop works as a usual K8S controller (reconcile the real state with a
> desired state) and automates the Cassandra operations through JMX. All the
> operations are launched by calling standard K8S APIs (kubectl apply …) or
> by using a K8S plugin (kubectl casskop …).
>
> CassKop is developed in GO, based on CoreOS operator-sdk framework.
> Main features already available :
> - deploying a rack aware cluster (or AZ aware cluster)
> - scaling up & down (including cleanups)
> - setting and modifying configuration parameters (C* and JVM parameters)
> - adding / removing a datacenter in Cassandra (all datacenters must be in
> the same region)
> - rebuilding nodes
> - removing node or replacing node (in case of hardware failure)
> - upgrading C* or Java versions (including upgradesstables)
> - monitoring (using Prometheus/Grafana)
> - ...
>
> By using local and persistent volumes, it is possible to handle failures
> or stop/start nodes for maintenance operations with no transfer of data
> between nodes.
> Moreover, we can deploy cassandra-reaper in K8S and use it for scheduling
> repair sessions.
> For now, we can deploy a C* cluster only as a mono-region cluster. We will
> work during the next weeks to be able to deploy a C* cluster as a multi
> regions cluster.
>
> Still in the roadmap :
> - Network encryption
> - Monitoring (exporting logs and metrics)
> - backup & restore
> - multi-regions support
>
> We'd be interested to hear you try this and let us know what you think!
>
> Please read the description and installation instructions on
> https://github.com/Orange-OpenSource/cassandra-k8s-operator.
> For a quick start, you can also follow this step by step guide :
> https://orange-opensource.github.io/cassandra-k8s-operator/index.html?slides=Slides-CassKop-demo.md#1
>
>
> The CassKop Team
>


Re: Cassandra cross dc replication row isolationCassandra cross dc replication row isolation

2019-05-07 Thread Rahul Singh
Depends on the consistency level you are setting on write and read.

What CL are you writing at and what CL are you reading at?

The consistency level tells the coordinator when to send acknowledgement of a 
write and whether to cross DCs to confirm a write. It also tells the 
coordinator how many replicas to read and whether or not to cross  DCs to get 
consensus.

Eg. Local_quorum is different from Quorum.
Local_quorum guarantees Data was saved to a quorum of nodes on the DC on which 
the Coordinator accepted the write. Similarly it would only check nodes in that 
DC. Quorum would check across DCs in the whole cluster.
On May 7, 2019, 12:11 PM -0500, Alexey Knyshev , 
wrote:
> Hi there!
>
> Could someone please explain how Column Family would be replicated and 
> "visible / readable" in the following scenario? Having multiple 
> geo-distributed datacenters with significant latency (up to 100ms RTT). Let's 
> name two of them A and B and consider the following 2 cases:
>
> 1. Cassandra client X inserts row into Column Family (CF) with Primary Key = 
> PK (all cells are set - no nulls possible). Write coordinator is in dc A. All 
> cells in this write should have the same writetime. For simplicity let's 
> assume that Cassandra coordinator node sets writetime. After some amount of 
> time (< RTT) client Y reads whole row (select * ...) from the same CF with 
> same PK talking to coordinator node from, another dc (B). Is it possible that 
> client Y will get some cells as NULLs, I mean, is it possible to read some 
> already replicated cells and for others get NULLs, or does Cassandra 
> guarantee row-level isolation / atomic write for that insert? Assume that row 
> (all cells for same PK will never be updated / deleted afterwards.
> 2. Same as in p.1 but after first write at PK same client (X) updates some 
> columns for the same PK. Will be this update isolated / atomically written 
> and eventually visible in another dc. Will client see isolated state as it 
> was before write or after it?
>
> Thanks in advance!
>
>
> --
> linkedin.com/profile
>
> github.com/alexeyknyshev
> bitbucket.org/alexeyknyshev


Re: Five Questions for Cassandra Users

2019-04-01 Thread Rahul Singh
Answers inline.


1.   Do the same people where you work operate the cluster and write
the code to develop the application?


No but the operators need to know development , data-modeling, and
generally how to "code" the application. (Coding is a low-level task of
assigning a code to a concept.. so I don't think that's the proper verb in
these scenarios.. engineering, or software development, or even programing
is a better term). It's because the developers are hired dime a dozen at
the B / C level and then replaced by D /E / F level developers as things go
on.. so the Data team eventually ends up being the expert of the
application and the data platform, and a "Center of Excellence" for the
development / architects to work with on a collaborative basis.



2.   Do you have a metrics stack that allows you to see graphs of
various metrics with all the nodes displayed together?



Yes. OpsCenter, ELK, Grafana, custom node data visualizers in excel
(because lines and charts don't tell you everything)


3.   Do you have a log stack that allows you to see the logs for all
the nodes together?

ELK. CloudWatch


4.   Do you regularly repair your clusters - such as by using Reaper?

 Depends. Cron, Reaper, OpsCenter Repair, and now NodeSync


5.   Do you use artificial intelligence to help manage your clusters?


Yes, I actually have made an artificial general intelligence called
Gravitron. It learns by ingesting all the news articles I aggregate about
Cassandra and the links I curate on cassandra.link into a solr/lucene index
and then using clustering find out the most popular and popularly connected
content. Once it does that there's a summarization of the content into
human readable content as well as interpreted bash code that gets pushed
into a "Recipe Book." As the master operator identifies scenarios using
english language, and then runs the bash commands, the machine slowly but
surely "wakes up" and starts to manage itself. It can also play Go , the
game, and beat IBM's AlphaGo at Go, and Donald Trump at golf while he was
cheating!



rahul.xavier.si...@gmail.com

http://cassandra.link

I'm speaking at #DataStaxAccelerate, the world’s premiere #ApacheCassandra
conference, and I want to see you there! Use my code Singh50 for 50% off
your registration. www.datastax.com/accelerate

































Happy april fools day.





On Thu, Mar 28, 2019 at 5:03 AM Kenneth Brotman
 wrote:

> I’m looking to get a better feel for how people use Cassandra in
> practice.  I thought others would benefit as well so may I ask you the
> following five questions:
>
>
>
> 1.   Do the same people where you work operate the cluster and write
> the code to develop the application?
>
>
>
> 2.   Do you have a metrics stack that allows you to see graphs of
> various metrics with all the nodes displayed together?
>
>
>
> 3.   Do you have a log stack that allows you to see the logs for all
> the nodes together?
>
>
>
> 4.   Do you regularly repair your clusters - such as by using Reaper?
>
>
>
> 5.   Do you use artificial intelligence to help manage your clusters?
>
>
>
>
>
> Thank you for taking your time to share this information!
>
>
>
> Kenneth Brotman
>


Re: How do u setup networking for Opening Solr Web Interface when on cloud?

2019-04-01 Thread Rahul Singh
This is probably not a question for this community... but rather for
Datastax support or the Datastax Academy slack group. More specifically
this is a "how to expose solr securely" question which is amply answered
well on the interwebs if you look for it on Google.


rahul.xavier.si...@gmail.com

http://cassandra.link

I'm speaking at #DataStaxAccelerate, the world’s premiere #ApacheCassandra
conference, and I want to see you there! Use my code Singh50 for 50% off
your registration. www.datastax.com/accelerate


On Mon, Apr 1, 2019 at 12:19 PM Krish Donald  wrote:

> Hi,
>
> We have DSE cassandra cluster running on AWS.
> Now we have requirement to enable Solr and Spark on the cluster.
> We have cassandra on private data subnet which has connectivity to app
> layer.
> From cassandra , we cant open direct Solr Web interface.
> We tried using SSH tunneling and it is working but we cant give SSH
> tunneling option to developers.
>
> We would like to create a Load Balancer  and put the cassandra nodes under
> that load balancer but the question here is , what health check i need to
> give for load balancer so that it can open the Solr Web UI ?
>
> My solution might not be perfect, please suggest any other solution if you
> have ?
>
> Thanks
>
>


Re: TWCS Compactions & Tombstones

2019-03-26 Thread Rahul Singh
What's your timewindow? Roughly how much data is in each window?

If you examine the sstable data and see that is truly old data with little
chance that it has any new data, you can just remove the SStables. You can
do a rolling restart -- take down a node, remove mc-254400-* and then start
it up.


rahul.xavier.si...@gmail.com

http://cassandra.link



On Tue, Mar 26, 2019 at 8:01 AM Nick Hatfield 
wrote:

> How does one properly rid of sstables that have fallen victim to
> overlapping timestamps? I realized that we had TWCS set in our CF which
> also had a read_repair = 0.1 and after correcting this to 0.0 I can clearly
> see the affects over time on the new sstables. However, I still have old
> sstables that date back some time last year, and I need to remove them:
>
> Max: 09/05/2018 Min: 09/04/2018 Estimated droppable tombstones:
> 0.883205790993204613G Mar 26 11:34 mc-254400-big-Data.db
>
>
> What is the best way to do this? This is on a production system so any
> help would be greatly appreciated.
>
> Thanks,
>


Re: Merging two cluster's in to one without any downtime

2019-03-26 Thread Rahul Singh
In my experience,

I'd use two methods to make sure that you are covering your ass.
1. "old school" methodology would be to do the SStable load from old to new
cluster -- if you do incremental snapshots, then you could technically
minimize downtime and just load the latest increments with a little
downtime. This is your fall back.
2. "new school" methodology would be to do all your insert/updates through
event sourcing , in which you use the CQRS (command/query request
segregation) which makes all updates into a series of commands, processed
by a processor. If you have this architecture already, this means you have
a durable message queue from which you can either a) replay all the
mutations or b) do old school method 1 from above to get the bulk of the
data, and then c) use a simultaneous processor for writing data to both old
and new clusters.

Triggers can work, but it's super clugy in C* 2. Also, you don't have CDC
in C* 2. Event sourcing + CQRS is the _literal_ best approach. Period. You
can do a true blue / green test on both clusters (old and new) to see if
your shit is consistent.

Pardon the language, but you get the message.


rahul.xavier.si...@gmail.com

http://cassandra.link



On Mon, Mar 25, 2019 at 7:31 PM Carl Mueller
 wrote:

> Either:
>
> double-write at the driver level from one of the apps and perform an
> initial and a subsequent sstable loads (or whatever ETL method you want to
> use) to merge the data with good assurances.
>
> use a trigger to replicate the writes, with some sstable loads / ETL.
>
> use change data capture with some sstable loads/ETL
>
> On Mon, Mar 25, 2019 at 5:48 PM Nick Hatfield 
> wrote:
>
>> Maybe others will have a different or better solution but, in my
>> experience to accomplish HA we simply y write from our application to the
>> new cluster. You then export the data from the old cluster using cql2json
>> or any method you choose, to the new cluster. That will cover all live(now)
>> data via y write, while supplying the old data from the copy you run. Once
>> complete, set up a single reader that reads data from the new cluster and
>> verify all is as expected!
>>
>>
>> Sent from my BlackBerry 10 smartphone on the Verizon Wireless 4G LTE network.
>> *From: *Nandakishore Tokala
>> *Sent: *Monday, March 25, 2019 18:39
>> *To: *user@cassandra.apache.org
>> *Reply To: *user@cassandra.apache.org
>> *Subject: *Merging two cluster's in to one without any downtime
>>
>> Please let me know the best practices to combine 2 different cluster's
>> into one without having any downtime.
>>
>> Thanks & Regards,
>> Nanda Kishore
>>
>


Re: good monitoring tool for cassandra

2019-03-14 Thread Rahul Singh
I wrote this last year. It's mostly still relevant --- as Jonathan said,
Prometheus+Grafana is the best "make your own hammers and nails" approach.

https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/



On Thu, Mar 14, 2019 at 8:13 PM Jonathan Haddad  wrote:

> I've worked with several teams using DataDog, folks are pretty happy with
> it.  We (The Last Pickle) did the dashboards for them:
> http://thelastpickle.com/blog/2017/12/05/datadog-tlp-dashboards.html
>
> Prometheus + Grafana is great if you want to host it yourself.
>
> On Fri, Mar 15, 2019 at 12:45 PM Jeff Jirsa  wrote:
>
>>
>> -dev, +user
>>
>> Datadog worked pretty well last time I used it.
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> > On Mar 14, 2019, at 11:38 PM, Sundaramoorthy, Natarajan <
>> natarajan_sundaramoor...@optum.com> wrote:
>> >
>> > Can someone share knowledge on good monitoring tool for cassandra?
>> Thanks
>> >
>> > This e-mail, including attachments, may include confidential and/or
>> > proprietary information, and may be used only by the person or entity
>> > to which it is addressed. If the reader of this e-mail is not the
>> intended
>> > recipient or his or her authorized agent, the reader is hereby notified
>> > that any dissemination, distribution or copying of this e-mail is
>> > prohibited. If you have received this e-mail in error, please notify the
>> > sender by replying to this message and delete this e-mail immediately.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>


-- 
rahul.xavier.si...@gmail.com

http://cassandra.link


Re: Adding New Column with Default Value

2019-03-14 Thread Rahul Singh
*Spark.* Alter the table, add a column. Run a spark job to scan your table,
and set a value.

* val myKeyspace = "pinch" val myTable = "hitter"*

*def updateColumns(row: CassandraRow): CassandraRow = { *
*  val inputMap = row.toMap val newData = Map( "newColumn" -> "somevalue"
) *
*  var outputMap = inputMap ++ newData CassandraRow.fromMap(outputMap) *
*}*

*val result = sc.cassandraTable(myKeyspace, myTable) .map(updateColumns(_))
.saveToCassandra(myKeyspace, myTable)*

Miraculously the same code could be used to move / copy data from one table
to another ... with a modification as long as you save to a different table
than from where you got it from.


On Thu, Mar 14, 2019 at 12:57 AM kumar bharath 
wrote:

> Hi ,
>
> Can anyone suggest  a best possible way, how we can add a new column to
> the existing table with default value ?
>
> *Column family Size :*  60 Million single partition records.
>
> Thanks,
> Bharath Kumar B
>


Re: update manually rows in cassandra

2019-03-14 Thread Rahul Singh
CQL supports JSON in and out from the Cassandra table, but if your JSON in
the table is a string, then you need to update it as a string.

https://docs.datastax.com/en/cql/3.3/cql/cql_using/useInsertJSON.html
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useQueryJSON.html

What's the schema of the table?


On Wed, Mar 13, 2019 at 5:09 PM Sundaramoorthy, Natarajan <
natarajan_sundaramoor...@optum.com> wrote:

> Update json data with the correct value from file. Thanks
>
>
>
>
>
> *Natarajan Sundaramoorthy*
>
> PaaS Engineering and Automation
>
> Desk - 763-744-1854
>
> Email - *natarajan_sundaramoor...@optum.com
> *
>
>
>
>  [image: cid:image001.jpg@01D03C99.02523830]
>
>
>
>
>
>
>
> *From:* Dieudonné Madishon NGAYA [mailto:dmng...@gmail.com]
> *Sent:* Wednesday, March 13, 2019 3:44 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: update manually rows in cassandra
>
>
>
> Hi ,
>
> In your case, do you want to insert json file data into cassandra ?
>
>
>
> Best regards
>
> _
>
>
> [image:
> https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
> 
>   
> 
>
> *Dieudonne Madishon NGAYA*
> Datastax, Cassandra Architect
> *P: *7048580065
> *w: *www.dmnbigdata.com
> *E: *dmng...@dmnbigdata.com
> *Private E: *dmng...@gmail.com
> *A: *Charlotte,NC,28273, USA
>
>
>
>
>
>
>
> On Wed, Mar 13, 2019 at 4:04 PM Sundaramoorthy, Natarajan <
> natarajan_sundaramoor...@optum.com> wrote:
>
>
>
> *Something got goofed up in database. Data is in json format. We have data
> in some file have to match the data in file and update the database. Can
> you please tell how to do it? New to cassandra. Thanks *
>
>
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person or entity
> to which it is addressed. If the reader of this e-mail is not the intended
> recipient or his or her authorized agent, the reader is hereby notified
> that any dissemination, distribution or copying of this e-mail is
> prohibited. If you have received this e-mail in error, please notify the
> sender by replying to this message and delete this e-mail immediately.
>


Re: Inconsistent results after restore with Cassandra 3.11.1

2019-03-14 Thread Rahul Singh
Can you define "inconsistent" results.. ? What's the topology of the
cluster? What were you expecting and what did you get?

On Thu, Mar 14, 2019 at 7:09 AM sandeep nethi 
wrote:

> Hello,
>
> Does anyone experience inconsistent results after restoring Cassandra
> 3.11.1 with refresh command? Was there any bug in this version of
> cassandra??
>
> Thanks in advance.
>
> Regards,
> Sandeep
>


Re: [EXTERNAL] Re: Migrate large volume of data from one table to another table within the same cluster when COPY is not an option.

2019-03-14 Thread Rahul Singh
Adding to Stefan's comment. There is a "scylladb" migrator, which uses the
spark connector from Datastax, and theoretically can work on any Cassandra
compiant DB.. and should not be limited to cassandra to scylla.

https://www.scylladb.com/2019/02/07/moving-from-cassandra-to-scylla-via-apache-spark-scylla-migrator/

https://github.com/scylladb/scylla-migrator

On Thu, Mar 14, 2019 at 3:04 PM Durity, Sean R 
wrote:

> The possibility of a highly available way to do this gives more
> challenges. I would be weighing the cost of a complex solution vs the
> possibility of a maintenance window when you stop your app to move the
> data, then restart.
>
>
>
> For the straight copy of the data, I am currently enamored with DataStax’s
> dsbulk utility for unloading and loading larger amounts of data. I don’t
> have extensive experience, yet, but it has been fast enough in my
> experiments – and that is without doing too much tuning for speed. From a
> host not in the cluster, I was able to extract 3.5 million rows in about 11
> seconds. I inserted them into a differently partitioned table in about 26
> seconds. Very small data rows, but it was impressive for not doing much to
> try and speed it up further. (In some other tests, it was about ¼ the time
> of simple copy statement from cqlsh)
>
>
>
> If I was designing something for a “can’t take an outage” scenario, I
> would start with:
>
> -  Writing the data to the old and new tables on all inserts
>
> -  On reads, read from the new table first. If not there, read
> from the old table ß could introduce some latency, but would be
> available; could also do asynchronous reads on both tables and choose the
> latest
>
> -  Do this until the data has been copied from old to new (with
> dsbulk or custom code or Spark)
>
> -  Drop the double writes and conditional reads
>
>
>
>
>
> Sean
>
>
>
> *From:* Stefan Miklosovic 
> *Sent:* Wednesday, March 13, 2019 6:39 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [EXTERNAL] Re: Migrate large volume of data from one table
> to another table within the same cluster when COPY is not an option.
>
>
>
> Hi Leena,
>
>
>
> as already suggested in my previous email, you could use Apache Spark and
> Cassandra Spark connector (1). I have checked TTLs and I believe you should
> especially read this section (2) about TTLs. Seems like thats what you need
> to do, ttls per row. The workflow would be that you read from your source
> table, making transformations per row (via some mapping) and then you would
> save it to new table.
>
>
>
> This would import it "all" but until you switch to the new table and
> records are still being saved into the original one, I am not sure how to
> cover "the gap" in such sense that once you make the switch, you would miss
> records which were created in the first table after you did the loading.
> You could maybe leverage Spark streaming (Cassandra connector knows that
> too) so you would make this transformation on the fly with new ones.
>
>
>
> (1) https://github.com/datastax/spark-cassandra-connector
> 
>
> (2)
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md#using-a-different-value-for-each-row
> 
>
>
>
>
>
> On Thu, 14 Mar 2019 at 00:13, Leena Ghatpande 
> wrote:
>
> Understand, 2nd table would be a better approach. So what would be the
> best way to copy 70M rows from current table to the 2nd table with ttl set
> on each record as the first table?
>
>
> --
>
> *From:* Durity, Sean R 
> *Sent:* Wednesday, March 13, 2019 8:17 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [EXTERNAL] Re: Migrate large volume of data from one table
> to another table within the same cluster when COPY is not an option.
>
>
>
> Correct, there is no current flag. I think there SHOULD be one.
>
>
>
>
>
> *From:* Dieudonné Madishon NGAYA 
> *Sent:* Tuesday, March 12, 2019 7:17 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Migrate large volume of data from one table to
> another table within the same cluster when COPY is not an option.
>
>
>
> Hi Sean, you can’t flag in Cassandra.yaml not allowing allow filtering ,
> the only thing you can do will be from your data model .
>
> Don’t ask Cassandra to query all data from table but the ideal query will
> be using single partition.
>
>
>
> On Tue, Mar 

Re: Audit in C*

2019-03-13 Thread Rahul Singh
Which version are you referring to?

On Wed, Mar 13, 2019 at 10:28 AM Nitan Kainth  wrote:

> Hi,
>
> Anybody have used auditing to find out failed login attempts, or
> unauthorized access tries.
>
> I found ecAudit by Ericsson, is it free to use? Has anybody tried it?
>
> Ref: https://github.com/Ericsson/ecaudit
>


Re: AxonOps - Cassandra operational management tool

2019-03-12 Thread Rahul Singh
Nice.. Good to see the community producing tools around the Cassandra
product.

Few pieces of feedback

*Kudos*
1. Glad that you are doing it
2. Looks great
3. Willing to try it out if you find this guy called "Free Time" for me :)

*Criticism*
1. It mimics a lot of stack components that are out there.. Though I agree
with you that the prometheus/grafana/etc stack is difficult to get running,
I look at
https://docs.scylladb.com/operating-scylla/monitoring/monitoring_stack/ and
give them kudos for just making a simple tool to leverage what's there.
Even DSE is now drinking the prometheus coolaid
https://www.datastax.com/2018/12/improved-performance-diagnostics-with-datastax-metrics-collector

2. Given a choice of making something on my own (1), using a "stack"
approach similar to Scylla (2), buying something that DSE produces (3), or
buying AxonOps (4), the challenge for a practioner will be whether the cost
offsets the effective pains of options 1 (more time),2( less time),3 (money)


"It is not the critic who counts; not *the man* who points out how the
strong *man* stumbles, or where the doer of deeds could have done them
better. ... It is the man in the arena" - Teddy Roosevelt

Keep play in the Arena and looking forward to updates!





On Wed, Mar 6, 2019 at 10:15 AM AxonOps  wrote:

> Hi Kenneth,
>
> We  using AxonOps with on a number of production clusters already, but
> we're continuously improving it, so we've got a good level of comfort and
> confidence with the product with our own customers.
>
> In terms of our recommendations on the upper bounds of the cluster size,
> we do not know yet. The biggest resource user is with Elasticsearch that
> stores all the data. The free version available supports up to 6 nodes and
> AxonOps can easily support this.
>
> You can already install the product from our APT or YUM repos. The
> installation instructions are available here - https://docs.axonops.com
>
> Hayato
>
>
> On Tue, 5 Mar 2019 at 20:44, Kenneth Brotman 
> wrote:
>
>> Hayato,
>>
>>
>>
>> I agree with what you are addressing as I’ve always thought the big
>> elephant in the room regarding Cassandra was that you had to use all these
>> other tools, each of which requires updating, configuring changes, and that
>> too much attention had to be paid to all those other tools instead of what
>> your trying to accomplish; when instead if addressed it all could be
>> centralized, internalized, or something but clearly it was quite doable.
>>
>>
>>
>> Questions regarding where things are at:
>>
>>
>>
>> Are you using AxonOps in any of your clients Apache Cassandra production
>> clusters?
>>
>>
>>
>> What is the largest Cassandra cluster in which you use it?
>>
>>
>>
>> Would you recommend NOT using AxonOps on production clusters for now or
>> do you consider it safe to do so?
>>
>>
>>
>> What is the largest Cassandra cluster you would recommend using AxonOps
>> on?
>>
>>
>>
>> Can it handle multi-cloud clusters?
>>
>>
>>
>> Which clouds does it play nice with?
>>
>>
>>
>> Is it good for use for on-prem nodes (or cloud only)?
>>
>>
>>
>> Which versions of Cassandra does it play nice with?
>>
>>
>>
>> Any rough idea when a download will be available?
>>
>>
>>
>> Your blog post at
>> https://digitalis.io/blog/apache-cassandra-management-tool/ provides a
>> lot of answers already!  Really very promising!
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Kenneth Brotman
>>
>>
>>
>>
>>
>>
>>
>> *From:* AxonOps [mailto:axon...@digitalis.io]
>> *Sent:* Sunday, March 03, 2019 7:51 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: AxonOps - Cassandra operational management tool
>>
>>
>>
>> Hi Kenneth,
>>
>>
>>
>> Thanks for your great feedback! We're not trying to be secretive, but
>> just not amazing at promoting ourselves!
>>
>>
>>
>> AxonOps was built by digitalis.io (https://digitalis.io), a company
>> based in the UK providing consulting and managed services for Cassandra,
>> Kafka and Spark. digitalis.io was founded 3 years ago by 2 ex-DataStax
>> architects but their experience of Cassandra predates the tenure at
>> DataStax.
>>
>>
>>
>> We have been looking after a lot of Cassandra clusters for our customers,
>> but found ourselves spending more time maintaining monitoring and
>> operational tools than Cassandra clusters themselves. The motivation was to
>> build a management platform to make our lives easier. You can read my blog
>> here - https://digitalis.io/blog/apache-cassandra-management-tool/
>>
>>
>>
>> We have not yet created any videos but that's in our backlog so people
>> can see AxonOps in action. No testimonials yet either since the customer of
>> the product has been ourselves, and only just released it to the public as
>> beta few weeks ago. We've decided to share it for free to anybody using up
>> to 6 nodes, as we see a lot of clusters out there within this range.
>>
>>
>>
>> The only investment would be a minimum amount of your time to install it.
>> We have made the installation process as easy as 

Re: cassandra upgrades multi-DC in parallel

2019-03-12 Thread Rahul Singh
Carl,

If you have done an automation and tested it a few time on a lower
environment with the same data from production, I'd say go for it.. but as
Jonathan said, if there's an issue, you won't be able to continue
operations.



On Tue, Mar 12, 2019 at 3:20 PM Jonathan Haddad  wrote:

> Nothing prevents it technically, but operationally you might not want to.
> Personally I’d prefer have the safety net of a dc to fall back on in case
> there’s an issue with the upgrade.
>
> On Wed, Mar 13, 2019 at 7:48 AM Carl Mueller
>  wrote:
>
>> If there are multiple DCs in a cluster, is it safe to upgrade them in
>> parallel, with each DC doing a node-at-a-time?
>>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>


Re: [EXTERNAL] RE: SASI queries- cqlsh vs java driver

2019-02-27 Thread Rahul Singh
+1 on Datastax and could consider looking at Elassandra.

On Thu, Feb 7, 2019 at 9:14 AM Durity, Sean R 
wrote:

> Kenneth is right. Trying to port/support a relational model to a CQL model
> the way you are doing it is not going to go well. You won’t be able to
> scale or get the search flexibility that you want. It will make Cassandra
> seem like a bad fit. You want to play to Cassandra’s strengths –
> availability, low latency, scalability, etc. so you need to store the data
> the way you want to retrieve it (query first modeling!). You could look at
> defining the “right” partition and clustering keys, so that the searches
> are within a single, reasonably sized partition. And you could have lookup
> tables for other common search patterns (item_by_model_name, etc.)
>
>
>
> If that kind of modeling gets you to a situation where you have too many
> lookup tables to keep consistent, you could consider something like
> DataStax Enterprise Search (embedded SOLR) to create SOLR indexes on
> searchable fields. A SOLR query will typically be an order of magnitude
> slower than a partition key lookup, though.
>
>
>
> It really boils down to the purpose of the data store. If you are looking
> for primarily an “anything goes” search engine, Cassandra may not be a good
> choice. If you need Cassandra-level availability, extremely low latency
> queries (on known access patterns), high volume/low latency writes, easy
> scalability, etc. then you are going to have to rethink how you model the
> data.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Kenneth Brotman 
> *Sent:* Thursday, February 07, 2019 7:01 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] RE: SASI queries- cqlsh vs java driver
>
>
>
> Peter,
>
>
>
> Sounds like you may need to use a different architecture.  Perhaps you
> need something like Presto or Kafka as a part of the solution.  If the data
> from the legacy system is wrong for Cassandra it’s an ETL problem?  You’d
> have to transform the data you want to use with Cassandra so that a proper
> data model for Cassandra can be used.
>
>
>
> *From:* Peter Heitman [mailto:pe...@heitman.us ]
> *Sent:* Wednesday, February 06, 2019 10:05 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: SASI queries- cqlsh vs java driver
>
>
>
> Yes, I have read the material. The problem is that the application has a
> query facility available to the user where they can type in "(A = foo AND B
> = bar) OR C = chex" where A, B, and C are from a defined list of terms,
> many of which are columns in the mytable below while others are from other
> tables. This query facility was implemented and shipped years before we
> decided to move to Cassandra
>
> On Thu, Feb 7, 2019, 8:21 AM Kenneth Brotman 
> wrote:
>
> The problem is you’re not using a query first design.  I would recommend
> first reading chapter 5 of Cassandra: The Definitive Guide by Jeff
> Carpenter and Eben Hewitt.  It’s available free online at this link
> 
> .
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Peter Heitman [mailto:pe...@heitman.us]
> *Sent:* Wednesday, February 06, 2019 6:33 PM
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: SASI queries- cqlsh vs java driver
>
>
>
> Yes, I "know" that allow filtering is a sign of a (possibly fatal)
> inefficient data model. I haven't figured out how to do it correctly yet
>
> On Thu, Feb 7, 2019, 7:59 AM Kenneth Brotman 
> wrote:
>
> Exactly.  When you design your data model correctly you shouldn’t have to
> use ALLOW FILTERING in the queries.  That is not recommended.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Peter Heitman [mailto:pe...@heitman.us]
> *Sent:* Wednesday, February 06, 2019 6:09 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: SASI queries- cqlsh vs java driver
>
>
>
> You are completely right! My problem is that I am trying to port code for
> SQL to CQL for an application that provides the user with a relatively
> general search facility. The original implementation didn't worry about
> secondary indexes - it just took advantage of the ability to create
> arbitrarily complex queries with inner joins, left joins, etc. I am
> reimplimenting it to create a parse tree of CQL queries and doing the ANDs
> and ORs in the application. Of course once I get enough of this implemented
> I will have to load up the table with a large data set and see if it gives
> acceptable performance for our use case.
>
> On Wed, Feb 6, 2019, 8:52 PM Kenneth Brotman 
> 

Re: High GC pauses leading to client seeing impact

2019-02-27 Thread Rahul Singh
There are a few factors: sometimes data that is in a fat partition clogs up
the heap space / memtable space and tombstones don't help that much either.
This is worsened by data skew . I agree , if CMS is working for now,
continue using it and then upgrade to better versions of Java / C*.

Few things you can do to see what's going on

1. Use Hubspot's gc visualizer https://github.com/HubSpot/gc_log_visualizer
2. Look at the heapdump in a  JVM explorer to see what is taking up memory.
https://docs.oracle.com/javase/8/docs/technotes/guides/visualvm/heapdump.html

GC visualizer shows you the patterns of GC... and the VisualVM if you can
connect to an existing JVM instance will show you what is coming and going.
Looking at a heapdump also helps you see what type of objects are in memory
-- rows, cells, which cell types





On Mon, Feb 11, 2019 at 3:06 AM Elliott Sims  wrote:

> I would strongly suggest you consider an upgrade to 3.11.x.  I found it
> decreased space needed by about 30% in addition to significantly lowering
> GC.
>
> As a first step, though, why not just revert to CMS for now if that was
> working ok for you?  Then you can convert one host for diagnosis/tuning so
> the cluster as a whole stays functional.
>
> That's also a pretty old version of the JDK to be using G1.  I would
> definitely upgrade that to 1.8u202 and see if the problem goes away.
>
> On Sun, Feb 10, 2019, 10:22 PM Rajsekhar Mallick  wrote:
>
>> Hello Team,
>>
>> I have a cluster of 17 nodes in production.(8 and 9 nodes in 2 DC).
>> Cassandra version: 2.0.11
>> Client connecting using thrift over port 9160
>> Jdk version : 1.8.066
>> GC used : G1GC (16GB heap)
>> Other GC settings:
>> Maxgcpausemillis=200
>> Parallels gc threads=32
>> Concurrent gc threads= 10
>> Initiatingheapoccupancypercent=50
>> Number of cpu cores for each system : 40
>> Memory size: 185 GB
>> Read/sec : 300 /sec on each node
>> Writes/sec : 300/sec on each node
>> Compaction strategy used : Size tiered compaction strategy
>>
>> Identified issues in the cluster:
>> 1. Disk space usage across all nodes in the cluster is 80%. We are
>> currently working on adding more storage on each node
>> 2. There are 2 tables for which we keep on seeing large number of
>> tombstones. One of table has read requests seeing 120 tombstones cells in
>> last 5 mins as compared to 4 live cells. Tombstone warns and Error messages
>> of query getting aborted is also seen.
>>
>> Current issue sen:
>> 1. We keep on seeing GC pauses of few minutes randomly across nodes in
>> the cluster. GC pauses of 120 seconds, even 770 seconds are also seen.
>> 2. This leads to nodes getting stalled and client seeing direct impact
>> 3. The GC pause we see, are not during any of G1GC phases. The GC log
>> message prints “Time to stop threads took 770 seconds”. So it is not the
>> garbage collector doing any work but stopping the threads at a safe point
>> is taking so much of time.
>> 4. This issue has surfaced recently after we changed 8GB(CMS) to
>> 16GB(G1GC) across all nodes in the cluster.
>>
>> Kindly do help on the above issue. I am not able to exactly understand if
>> the GC is wrongly tuned, other if this is something else.
>>
>> Thanks,
>> Rajsekhar Mallick
>>
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>


Re: Connection status on cluster exposed anywhere?

2019-02-27 Thread Rahul Singh
You can get statistics at a table level, and you can get some information
at a keyspace level, but it's an approximation. Better to get tablelevel
and aggregate up. Here are some pointers.

https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/



On Wed, Feb 27, 2019 at 4:36 AM Tom Wollert 
wrote:

> Is it possible to get the current connection status out of Cassandra C#
> driver?
>
> In particular I'm looking at getting
> - connected hosts
> - whether the node is considered up by the driver
> Which I can get from the MetaData, as well as:
> - connected keyspaces
> - the number of connections open(per keyspace, as the connectionpool is
> not shared)
> - the number of current read/writes (per keyspace/host ideally)
> Which seems to be internal state that is never exposed.
>
> Any ideas before I use reflection to look at the internal state?
>
> Cheers,
>
> Tom
>
> 
> *Codeweavers Showroom System Q1 Goals
> *
>
> 
> *Who could dealers collaborate with for shared success?
> *
>
> *Phone:* 0800 021 0888  * Email: *contac...@codeweavers.net
> *Codeweavers Ltd* | Barn 4 | Dunston Business Village | Dunston | ST18 9AB
> Registered in England and Wales No. 04092394 | VAT registration no. 974
> 9705 63
>
> [image: Twitter]  [image: Facebook]
>  [image: linkedin]
> 
>


Re: Feedback wanted for Knowledge base for all things cassandra (cassandra.link)

2019-02-25 Thread Rahul Singh
Kenneth,

Thanks for the support! I totally agree and that's exactly how I think of
it.. as a Body of Knowledge. Each of those iterations are to show how
people have asked for the knowledge to be presented so far.

Awesome is a proven format for many people for specific lists of useful
links for a subject. The second iteration was more for folks that want all
the "best videos" or "slides" etc .. and the third is to be able to filter
it. There's another use case of specifically human/curating / ordering of
existing links to a specific TOC subject or role as you describe it, or to
a Cassandra doc command for example. Your structure makes a lot of sense.
For the time being, the backend structure I'm using only has tags for
taxonomy.. In addition to attributes like you mentioned, I am also
considering a richer "book" or "tree" like organization via an internal
directed graph to provide the type of structure books series of articles
provide.

Google's value is in the discovery of the knowledge, but they don't
necessarily have the human level Intel to rank it based on what a beginner
,  intermediate, or advanced person may see as the same knowledge. That
level of knowledge ranking , which I do work on is a specialization of
personalization but requires tons of feedback (the kind that google gets
with searches+clicks + more). For the time being, the goal is simple: How
to get people from zero to hero .. or 0.5 to hero in the shortest amount
possible, and also still be relevant for heroes to come back and keep tabs
on what's happening.

What I don't want is for it to be a "news" hub primarily because that's
already done with the alteroot.org site and unless they take it down, I'll
probably just take their feed and integrate it and add others as I find
them.

I wish I had more time to contribute!

Best,


On Mon, Feb 25, 2019 at 11:17 AM Kenneth Brotman
 wrote:

> Hi Rahul!
>
>
>
> A truly outstanding effort for the community.  I think it should be tied
> to a complete what’s known as a “Body of Knowledge” which is the best way
> to organize it as a learning resource.  Otherwise, you are just trying to
> out google Google.  Lots of luck.  They will always have superior search
> and A.I.  You’d be just looking up subject matter sources on Google to try
> to keep your data current and complete.
>
>
>
> But, if you seek to describe the competencies of various roles related to
> Cassandra – role based is best structure and describing competencies is the
> next step flowing from the body of knowledge - then you have a specialized
> resource for all learners of Cassandra at all levels of competence and for
> all roles related to Cassandra.
>
>
>
> Regarding the resource directories, for several reasons all resource
> directories should show:
>
>   Type of medium: i.e. video, book, article, course, etc.
>
>   Date of production
>
>   Author/Presenter
>
>   Publisher/Producer/Event
>
>
>
> Thank you for the continuing effort you have made on this project Rahul!
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Rahul Singh [mailto:rahul.xavier.si...@gmail.com]
> *Sent:* Monday, February 25, 2019 7:05 AM
> *To:* user
> *Subject:* Feedback wanted for Knowledge base for all things cassandra
> (cassandra.link)
>
>
>
> Folks,
>
>
> I've been scrounging time to work on a knowledge resource for all things
> Cassandra ( Cassandra, DSE, Scylla, YugaByte, Elassandra)
>
> I feel like the Cassandra core community still has the most knowledge even
> though people are fragmenting into their brands.
>
>
>
> Would love to get your feedback on what you guys would want as a go to
> resource for Cassandra development, administration, architecture, etc.
> resources.
>
>
> *MVP  1*
>
> https://anant.github.io/awesome-cassandra
>
> *MVP  2*
>
> https://cassandra.netlify.com/
>
> *MVP  3*
>
>
> https://leaves-search.netlify.com/documents.html#/q=*:*=tags:(cassandra)=*=20&=
>   -
>
>
>
> Each of these were iterated with feedback from the community, so would
> love to get your feedback to make it better.
>
>
>
> Up next is to add the RSS feeds from the major Cassandra folks like on
> https://cassandra.alteroot.org
>
>
>
> Thanks for your feedback in advance.
>
>
>


Feedback wanted for Knowledge base for all things cassandra (cassandra.link)

2019-02-25 Thread Rahul Singh
Folks,

I've been scrounging time to work on a knowledge resource for all things
Cassandra ( Cassandra, DSE, Scylla, YugaByte, Elassandra)

I feel like the Cassandra core community still has the most knowledge even
though people are fragmenting into their brands.

Would love to get your feedback on what you guys would want as a go to
resource for Cassandra development, administration, architecture, etc.
resources.

*MVP  1*
https://anant.github.io/awesome-cassandra


*MVP  2*
https://cassandra.netlify.com/

*MVP  3*
https://leaves-search.netlify.com/documents.html#/q=*:*=tags:(cassandra)=*=20&=
  -

Each of these were iterated with feedback from the community, so would love
to get your feedback to make it better.

Up next is to add the RSS feeds from the major Cassandra folks like on
https://cassandra.alteroot.org

Thanks for your feedback in advance.


Re: C* as fluent data storage, 10MB/sec/node?

2018-12-20 Thread Rahul Singh
Agree with JEFF in twcs. Also look
At https://github.com/paradoxical-io/cassieq for reference. Good ideas for a 
queue on Cassandra.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Nov 28, 2018, 5:33 PM -0500, Adam Smith , wrote:
> Thanks for the excellent advice, this was extremely helpful! Did not know 
> about TWCS... curing a lot of headache.
>
> Adam
>
> > Am Mi., 28. Nov. 2018 um 20:47 Uhr schrieb Jeff Jirsa :
> > > Probably fine as long as there’s some concept of time in the partition 
> > > key to keep them from growing unbounded.
> > >
> > > Use TWCS, TTLs and something like 5-10 minute buckets. Don’t use RF=1, 
> > > but you can write at CL ONE. TWCS will largely just drop whole sstables 
> > > as they expire (especially with 3.11 and the more aggressive expiration 
> > > logic there)
> > >
> > >
> > >
> > > --
> > > Jeff Jirsa
> > >
> > >
> > > > On Nov 28, 2018, at 11:24 AM, Adam Smith  
> > > > wrote:
> > > >
> > > > Hi All,
> > > >
> > > > I need to use C* somehow as fluent data storage - maybe this is 
> > > > different to the queue antipattern? Lots of data come in 
> > > > (10MB/sec/node), remains for e.g. 1 hour and should then be evicted. It 
> > > > is somehow not critical when data would occasionally disappear/get lost.
> > > >
> > > > Thankful for any advice!
> > > >
> > > > Is this nowadays possible without suffering too much from compactation? 
> > > > I would not have ranged tombstones, and depending on a possible 
> > > > solution only using point deletes (PK+CK). There is only one CK, could 
> > > > also be empty.
> > > >
> > > > 1) The data is usually 1 MB. Can I just update with empty data? PK + CK 
> > > > would remain, but I would not carry about that. Would this create 
> > > > tombstones or is equivalent to a DELETE?
> > > >
> > > > 2) Like 1) and later then set a TTL == small amount of data to be 
> > > > deleted then? And hopefully small compactation?
> > > >
> > > > 3) Simply setting TTL 1h and hoping the best, because I am wrong with 
> > > > my worries?
> > > >
> > > > 4) Any optimization strategies like setting the RF to 1? Which 
> > > > compactation strategy is advised?
> > > >
> > > > 5) Are there any recent performance benchmarks for one of the scenarios?
> > > >
> > > > What else could I do?
> > > >
> > > > Thanks a lot!
> > > > Adam
> > >
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > >


Re: Optimizing for connections

2018-12-20 Thread Rahul Singh
See inline

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Dec 9, 2018, 2:02 PM -0500, Devaki, Srinivas , wrote:
> Hi Guys,
>
> Have a couple of questions regarding the connections to cassandra,
>
> 1. What are the recommended number of connections per cassandra node?

Depends on hardware.

> 2. Is it a good idea to create coordinator nodes(with `num_token: 0`) and 
> whitelisting only those hosts from client side? so that I can isolate main 
> worker don't need to work on connection threads

Defeats the purpose of having a masterless system.

> 3. does the request time on client side include connect time?

Who is measuring?


> 4. Is there any hard limit on number of connections that can be set on 
> cassandra?
>

Read : 
https://stackoverflow.com/questions/33562374/cassandra-throttling-workload

> Thanks a lot for your help
>


Re: Alter table

2018-12-20 Thread Rahul Singh
If you use collections such as a map you could get by with just 
upserts. A collection in a column gives you the ability to have “flexible” 
schema for your “documents” as in mongo while the regular fields can act as 
“records” as in a more
Traditional table.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Dec 17, 2018, 4:45 PM -0500, Mark Furlong , wrote:
> Why would I want to use alter table vs upserts with the new document format?
>
> Mark Furlong
> Sr. Database Administrator
> mfurl...@ancestry.com
> M: 801-859-7427
> O: 801-705-7115
> 1300 W Traverse Pkwy
> Lehi, UT 84043
>
>
>
>
>


Re: Cassandra repair in different version

2018-09-21 Thread Rahul Singh
Is there a reason why these versions are so different ? I would recommend 
bringing 3.0.6 to 3.0.13 before doing cluster wise commands.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 19, 2018, 2:02 PM +0200, nokia ceph , wrote:
> Hi ,
>
> i have 5 node cassandra cluster. 3 nodes are in 3.0.13 version and 2 nodes 
> are in 3.0.6.
> Is it safe to do a node repair on the cluster??
>
>  Regards,
> Renoy
>


Re: Cassandra system table diagram

2018-09-21 Thread Rahul Singh
I think his question was related specifically to the system tables. KDM is a 
good tool for designing the tables but not necessarily for viewing the system 
tables.

Abdul, try out a tool called DB Schema Visualizer. It supports Cassandra

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 19, 2018, 11:49 PM +0200, Joseph Arriola , wrote:
> Hi, I recomend this
>
> http://kdm.dataview.org/
>
> its free and it has the implementation of good design practices that you can 
> find in the DataStax courses.
>
>
>
>
> > 2018-09-19 11:26 GMT-06:00 Abdul Patel :
> > > Hi,
> > >
> > > Do we have somehwere cassandra system tables relation diagram?
> > > Or just system table diagram?
>


Re: Scrub a single SSTable only?

2018-09-11 Thread Rahul Singh
What’s the RF for that data ? If you can manage downtime one node I’d recommend 
just bringing it down, and then repairing after you delete the bad file and 
bring it back up.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 11, 2018, 2:55 AM -0400, Steinmaurer, Thomas 
, wrote:
> Hello,
>
> is there a way to Online scrub a particular SSTable file only and not the 
> entire column family?
>
> According to the Cassandra logs we have a corrupted SSTable smallish compared 
> to the entire data volume of the column family in question.
>
> To my understanding, both, nodetool scrub and sstablescrub operate on the 
> entire column family and can’t work on a single SSTable, right?
>
> There is still the way to shutdown Cassandra and remove the file from disk, 
> but ideally I want to have that as an online operation.
>
> Perhaps there is something JMX based?
>
> Thanks,
> Thomas
>
> The contents of this e-mail are intended for the named addressee only. It 
> contains information that may be confidential. Unless you are the named 
> addressee or an authorized designee, you may not copy or use it, or disclose 
> it to anyone else. If you received it in error please notify us immediately 
> and then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) 
> is a company registered in Linz whose registered office is at 4040 Linz, 
> Austria, Freistädterstraße 313


Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-11 Thread Rahul Singh
You know what they say: Go big or go home.

Right now candidates are Cassandra itself but embedded or on the side not on 
the actual data clusters, zookeeper (yuck) , Kafka (which needs zookeeper, 
yuck) , S3 (outside service dependency, so no go. )

Jeff, Those are great patterns. ESP. Second one. Have used it several times. 
Cassandra is a great place to store data in transport.


Rahul
On Sep 10, 2018, 5:21 PM -0400, DuyHai Doan , wrote:
> Also using Calvin means having to implement a distributed monotonic sequence 
> as a primitive, not trivial at all ...
>
> > On Mon, Sep 10, 2018 at 3:08 PM, Rahul Singh  
> > wrote:
> > > In response to mimicking Advanced replication in DSE. I understand the 
> > > goal. Although DSE advanced replication does one way, those are use cases 
> > > with limited value to me because ultimately it’s still a master slave 
> > > design.
> > >
> > > I’m working on a prototype for this for two way replication between 
> > > clusters or databases regardless of dB tech - and every variation I can 
> > > get to comes down to some implementation of the Calvin protocol which 
> > > basically verifies the change in either cluster , sequences it according 
> > > to impact to underlying data, and then schedules the mutation in a 
> > > predictable manner on both clusters / DBS.
> > >
> > > All that means is that I need to sequence the change before it happens so 
> > > I can predictably ensure it’s Scheduled for write / Mutation. So I’m
> > > Back to square one: having a definitive queue / ledger separate from the 
> > > individual commit log of the cluster.
> > >
> > >
> > > Rahul Singh
> > > Chief Executive Officer
> > > m 202.905.2818
> > >
> > > Anant Corporation
> > > 1010 Wisconsin Ave NW, Suite 250
> > > Washington, D.C. 20007
> > >
> > > We build and manage digital business technology platforms.
> > > On Sep 10, 2018, 3:58 AM -0400, Dinesh Joshi 
> > > , wrote:
> > > > > On Sep 9, 2018, at 6:08 AM, Jonathan Haddad  
> > > > > wrote:
> > > > >
> > > > > There may be some use cases for it.. but I'm not sure what they are.  
> > > > > It might help if you shared the use cases where the extra complexity 
> > > > > is required?  When does writing to Cassandra which then dedupes and 
> > > > > writes to Kafka a preferred design then using Kafka and simply 
> > > > > writing to Cassandra?
> > > >
> > > > From the reading of the proposal, it seems bring functionality similar 
> > > > to MySQL's binlog to Kafka connector. This is useful for many 
> > > > applications that want to be notified when certain (or any) rows change 
> > > > in the database primarily for a event driven application architecture.
> > > >
> > > > Implementing this in the database layer means there is a standard 
> > > > approach to getting a change notification stream. Downstream 
> > > > subscribers can then decide which notifications to act on.
> > > >
> > > > LinkedIn's databus is similar in functionality - 
> > > > https://github.com/linkedin/databus However it is for heterogenous 
> > > > datastores.
> > > >
> > > > > > On Thu, Sep 6, 2018 at 1:53 PM Joy Gao  
> > > > > > wrote:
> > > > > > >
> > > > > > >
> > > > > > > We have a WIP design doc that goes over this idea in details.
> > > > > > >
> > > > > > > We haven't sort out all the edge cases yet, but would love to get 
> > > > > > > some feedback from the community on the general feasibility of 
> > > > > > > this approach. Any ideas/concerns/questions would be helpful to 
> > > > > > > us. Thanks!
> > > > > > >
> > > >
> > > > Interesting idea. I did go over the proposal briefly. I concur with Jon 
> > > > about adding more use-cases to clarify this feature's potential 
> > > > use-cases.
> > > >
> > > > Dinesh
>


Re: Regarding migrating data from Oracle to Cassandra.migrate data from Oracle to Cassandra.

2018-09-10 Thread Rahul Singh
Look into Kafka Connect. It does tracking internally in a topic. Works better 
going from relational to Cassandra.

Still won’t fix your potential data model issue related to skew and wide 
partitions.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 6, 2018, 9:20 AM -0400, sha p , wrote:
> Thank you Jeff.
> While migration , how can test/validate against Cassandra particularly i am 
> going for "parallel run". Any sample strategy?
>
>
> Regards,
> Shyam
>
> > On Thu, 6 Sep 2018, 09:48 Jeff Jirsa,  wrote:
> > > It very much depends on your application. You'll PROBABLY want to double 
> > > write for some period of time -  start writes to both Cassandra and 
> > > Oracle, and then ensure they're both in sync. Once you're sure they're 
> > > both in sync, move your reads from Oracle to Cassandra.
> > >
> > >
> > >
> > > > On Wed, Sep 5, 2018 at 8:58 PM sha p  wrote:
> > > > > Hi all,
> > > > > Sir how should I keep track of the data which is moved to Cassandra , 
> > > > > what are the best strategies available?
> > > > >
> > > > > Regards,
> > > > > Shyam
> > > > >
> > > > > > On Wed, 5 Sep 2018, 18:51 sha p,  wrote:
> > > > > > > >
> > > > > > > > > Hi all ,
> > > > > > > > > Me new to Cassandra , i was asked to migrate data from Oracle 
> > > > > > > > > to Cassandra.
> > > > > > > > > Please help me giving your valuable guidance.
> > > > > > > > > 1) Can it be done using open source Cassandra.
> > > > > > > > > 2) Where should I start data model from?
> > > > > > > > > 3) I should use java, what kind of  jar/libs/tools I need use 
> > > > > > > > > ?
> > > > > > > > > 4) How I decide the size of cluster , please provide some 
> > > > > > > > > sample guidelines.
> > > > > > > > > 5) this should be in production , so what kind of things i 
> > > > > > > > > should take care for better support or debugging tomorrow?
> > > > > > > > > 6) Please provide some good books /links which can help me in 
> > > > > > > > > this task.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks in advance.
> > > > > > > > > Highly appreciated your every amal help.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Shyam


Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-10 Thread Rahul Singh
In response to mimicking Advanced replication in DSE. I understand the goal. 
Although DSE advanced replication does one way, those are use cases with 
limited value to me because ultimately it’s still a master slave design.

I’m working on a prototype for this for two way replication between clusters or 
databases regardless of dB tech - and every variation I can get to comes down 
to some implementation of the Calvin protocol which basically verifies the 
change in either cluster , sequences it according to impact to underlying data, 
and then schedules the mutation in a predictable manner on both clusters / DBS.

All that means is that I need to sequence the change before it happens so I can 
predictably ensure it’s Scheduled for write / Mutation. So I’m
Back to square one: having a definitive queue / ledger separate from the 
individual commit log of the cluster.


Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 10, 2018, 3:58 AM -0400, Dinesh Joshi , 
wrote:
> > On Sep 9, 2018, at 6:08 AM, Jonathan Haddad  wrote:
> >
> > There may be some use cases for it.. but I'm not sure what they are.  It 
> > might help if you shared the use cases where the extra complexity is 
> > required?  When does writing to Cassandra which then dedupes and writes to 
> > Kafka a preferred design then using Kafka and simply writing to Cassandra?
>
> From the reading of the proposal, it seems bring functionality similar to 
> MySQL's binlog to Kafka connector. This is useful for many applications that 
> want to be notified when certain (or any) rows change in the database 
> primarily for a event driven application architecture.
>
> Implementing this in the database layer means there is a standard approach to 
> getting a change notification stream. Downstream subscribers can then decide 
> which notifications to act on.
>
> LinkedIn's databus is similar in functionality - 
> https://github.com/linkedin/databus However it is for heterogenous datastores.
>
> > > On Thu, Sep 6, 2018 at 1:53 PM Joy Gao  wrote:
> > > >
> > > >
> > > > We have a WIP design doc that goes over this idea in details.
> > > >
> > > > We haven't sort out all the edge cases yet, but would love to get some 
> > > > feedback from the community on the general feasibility of this 
> > > > approach. Any ideas/concerns/questions would be helpful to us. Thanks!
> > > >
>
> Interesting idea. I did go over the proposal briefly. I concur with Jon about 
> adding more use-cases to clarify this feature's potential use-cases.
>
> Dinesh


Re: Using CDC Feature to Stream C* to Kafka (Design Proposal)

2018-09-10 Thread Rahul Singh
Not everyone has it their way like Frank Sinatra. Due to various reasons, folks 
need to get the changes in Cassandra to be duplicated to a topic for further 
processing - especially if the new system owner doesn’t own the whole platform.

There are various ways to do this but you have to deal with the consequences.

1. Kafka Connect using landoops current source connector which does “allow 
filtering” on tables. Sends changes to Kafka topic. Then you can either process 
using Kafka Streams, Kafka Connect sink, or Kafka Consumer API.

2. CDC to Kafka , especially if the CDC is coming from commit logs - you may 
see duplicates from nodes.

3. Triggers to Kafka , this is the only way I know now to do once only messages 
to Kafka for every mutation that Cassandra receives. This could be problematic 
because you may lose sending a message to Kafka — because you only get it once.

Ideally you’ll want to do what Jon suggested and source the event from Kafka 
for all subsequent processes rather than process in Cassandra and the create 
the event in Kafka.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 10, 2018, 3:58 AM -0400, Dinesh Joshi , 
wrote:
> > On Sep 9, 2018, at 6:08 AM, Jonathan Haddad  wrote:
> >
> > There may be some use cases for it.. but I'm not sure what they are.  It 
> > might help if you shared the use cases where the extra complexity is 
> > required?  When does writing to Cassandra which then dedupes and writes to 
> > Kafka a preferred design then using Kafka and simply writing to Cassandra?
>
> From the reading of the proposal, it seems bring functionality similar to 
> MySQL's binlog to Kafka connector. This is useful for many applications that 
> want to be notified when certain (or any) rows change in the database 
> primarily for a event driven application architecture.
>
> Implementing this in the database layer means there is a standard approach to 
> getting a change notification stream. Downstream subscribers can then decide 
> which notifications to act on.
>
> LinkedIn's databus is similar in functionality - 
> https://github.com/linkedin/databus However it is for heterogenous datastores.
>
> > > On Thu, Sep 6, 2018 at 1:53 PM Joy Gao  wrote:
> > > >
> > > >
> > > > We have a WIP design doc that goes over this idea in details.
> > > >
> > > > We haven't sort out all the edge cases yet, but would love to get some 
> > > > feedback from the community on the general feasibility of this 
> > > > approach. Any ideas/concerns/questions would be helpful to us. Thanks!
> > > >
>
> Interesting idea. I did go over the proposal briefly. I concur with Jon about 
> adding more use-cases to clarify this feature's potential use-cases.
>
> Dinesh


Re: [EXTERNAL] Regarding migrating data from Oracle to Cassandra.migrate data from Oracle to Cassandra.

2018-09-05 Thread Rahul Singh
Look here for some “migration” or data modeling articles.

https://anant.github.io/awesome-cassandra/

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 5, 2018, 10:47 AM -0500, Jeff Jirsa , wrote:
> All of  Sean's points are good, a few more:
> - Apache Cassandra (free, open source, official) is usually sufficient. DSE 
> may be faster, but really it's about whether or not you're willing to pay for 
> support. If you're trying to stop paying Oracle, I suspect you'd probably not 
> want to start paying someone else - try the free version first, and you can 
> look for proprietary options after that.
> - http://shop.oreilly.com/product/0636920043041.do is relatively recent and 
> mostly pretty good
> - Ask a lot of questions, use this list, but try things out first so people 
> have a way to point you in the right direction.
>
>
>
> > On Wed, Sep 5, 2018 at 7:58 AM Durity, Sean R  
> > wrote:
> > > 3 starting points:
> > > -  DO NOT migrate your tables as they are in Oracle to Cassandra. 
> > > In most cases, you need a different model for Cassandra
> > > -  DO take the (free) DataStax Academy courses to learn much more 
> > > about Cassandra as you dive in. It is a systematic and bite-size approach 
> > > to learning all things Cassandra (and eventually, DataStax Enterprise, 
> > > should you go that way). However, open source Cassandra is fine as a data 
> > > platform. DSE gives you more options for data models, better 
> > > administration and monitoring tools, support, etc. It all depends on what 
> > > you need/want to build/can afford
> > > -  Cluster sizing depends on your goals for the data platform. Do 
> > > you need lots of storage, lots of throughput, high availability, low 
> > > latency, workload separation, etc.? A couple guidelines – use at least 3 
> > > nodes per data center (DC) and at least 2 DCs for availability. Use SSDs 
> > > for storage and keep node size 3 TB or less for reasonable 
> > > administration.  If six nodes are too many – you probably don’t need 
> > > Cassandra. If you can define what you need your data platform to deliver, 
> > > then you can start a sizing discussion. The good thing is, you can always 
> > > scale (as long as the data model is good).
> > >
> > >
> > > Sean Durity
> > >
> > > From: sha p 
> > > Sent: Wednesday, September 05, 2018 9:21 AM
> > > To: user@cassandra.apache.org
> > > Subject: [EXTERNAL] Regarding migrating data from Oracle to 
> > > Cassandra.migrate data from Oracle to Cassandra.
> > >
> > >
> > > > quote_type
> > > > Hi all ,
> > > > Me new to Cassandra , i was asked to migrate data from Oracle to 
> > > > Cassandra.
> > > > Please help me giving your valuable guidance.
> > > > 1) Can it be done using open source Cassandra.
> > > > 2) Where should I start data model from?
> > > > 3) I should use java, what kind of  jar/libs/tools I need use ?
> > > > 4) How I decide the size of cluster , please provide some sample 
> > > > guidelines.
> > > > 5) this should be in production , so what kind of things i should take 
> > > > care for better support or debugging tomorrow?
> > > > 6) Please provide some good books /links which can help me in this task.
> > > >
> > > >
> > > > Thanks in advance.
> > > > Highly appreciated your every amal help.
> > > >
> > > > Regards,
> > > > Shyam
> > >
> > >
> > > The information in this Internet Email is confidential and may be legally 
> > > privileged. It is intended solely for the addressee. Access to this Email 
> > > by anyone else is unauthorized. If you are not the intended recipient, 
> > > any disclosure, copying, distribution or any action taken or omitted to 
> > > be taken in reliance on it, is prohibited and may be unlawful. When 
> > > addressed to our clients any opinions or advice contained in this Email 
> > > are subject to the terms and conditions expressed in any applicable 
> > > governing The Home Depot terms of business or client engagement letter. 
> > > The Home Depot disclaims all responsibility and liability for the 
> > > accuracy and content of this attachment and for any damages or losses 
> > > arising from any inaccuracies, errors, viruses, e.g., worms, trojan 
> > > horses, etc., or other items of a destructive nature, which may be 
> > > contained in this attachment and shall not be liable for direct, 
> > > indirect, consequential or special damages in connection with this e-mail 
> > > message or its attachment.


Re: [EXTERNAL] Regarding migrating data from Oracle to Cassandra.migrate data from Oracle to Cassandra.

2018-09-05 Thread Rahul Singh
The biggest issue you’ll have is that “migration” from a relational to 
Cassandra is not a 1 to 1. The schemas will have to change.

DSE has other technology that is a little more useful - such as Spark / Spark 
SQL / Solr that is built in which helps meet the needs which Oracle was 
previously providing.


Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 5, 2018, 10:47 AM -0500, Jeff Jirsa , wrote:
> All of  Sean's points are good, a few more:
> - Apache Cassandra (free, open source, official) is usually sufficient. DSE 
> may be faster, but really it's about whether or not you're willing to pay for 
> support. If you're trying to stop paying Oracle, I suspect you'd probably not 
> want to start paying someone else - try the free version first, and you can 
> look for proprietary options after that.
> - http://shop.oreilly.com/product/0636920043041.do is relatively recent and 
> mostly pretty good
> - Ask a lot of questions, use this list, but try things out first so people 
> have a way to point you in the right direction.
>
>
>
> > On Wed, Sep 5, 2018 at 7:58 AM Durity, Sean R  
> > wrote:
> > > 3 starting points:
> > > -  DO NOT migrate your tables as they are in Oracle to Cassandra. 
> > > In most cases, you need a different model for Cassandra
> > > -  DO take the (free) DataStax Academy courses to learn much more 
> > > about Cassandra as you dive in. It is a systematic and bite-size approach 
> > > to learning all things Cassandra (and eventually, DataStax Enterprise, 
> > > should you go that way). However, open source Cassandra is fine as a data 
> > > platform. DSE gives you more options for data models, better 
> > > administration and monitoring tools, support, etc. It all depends on what 
> > > you need/want to build/can afford
> > > -  Cluster sizing depends on your goals for the data platform. Do 
> > > you need lots of storage, lots of throughput, high availability, low 
> > > latency, workload separation, etc.? A couple guidelines – use at least 3 
> > > nodes per data center (DC) and at least 2 DCs for availability. Use SSDs 
> > > for storage and keep node size 3 TB or less for reasonable 
> > > administration.  If six nodes are too many – you probably don’t need 
> > > Cassandra. If you can define what you need your data platform to deliver, 
> > > then you can start a sizing discussion. The good thing is, you can always 
> > > scale (as long as the data model is good).
> > >
> > >
> > > Sean Durity
> > >
> > > From: sha p 
> > > Sent: Wednesday, September 05, 2018 9:21 AM
> > > To: user@cassandra.apache.org
> > > Subject: [EXTERNAL] Regarding migrating data from Oracle to 
> > > Cassandra.migrate data from Oracle to Cassandra.
> > >
> > >
> > > > quote_type
> > > > Hi all ,
> > > > Me new to Cassandra , i was asked to migrate data from Oracle to 
> > > > Cassandra.
> > > > Please help me giving your valuable guidance.
> > > > 1) Can it be done using open source Cassandra.
> > > > 2) Where should I start data model from?
> > > > 3) I should use java, what kind of  jar/libs/tools I need use ?
> > > > 4) How I decide the size of cluster , please provide some sample 
> > > > guidelines.
> > > > 5) this should be in production , so what kind of things i should take 
> > > > care for better support or debugging tomorrow?
> > > > 6) Please provide some good books /links which can help me in this task.
> > > >
> > > >
> > > > Thanks in advance.
> > > > Highly appreciated your every amal help.
> > > >
> > > > Regards,
> > > > Shyam
> > >
> > >
> > > The information in this Internet Email is confidential and may be legally 
> > > privileged. It is intended solely for the addressee. Access to this Email 
> > > by anyone else is unauthorized. If you are not the intended recipient, 
> > > any disclosure, copying, distribution or any action taken or omitted to 
> > > be taken in reliance on it, is prohibited and may be unlawful. When 
> > > addressed to our clients any opinions or advice contained in this Email 
> > > are subject to the terms and conditions expressed in any applicable 
> > > governing The Home Depot terms of business or client engagement letter. 
> > > The Home Depot disclaims all responsibility and liability for the 
> > > accuracy and content of this attachment and for any damages or losses 
> > > arising from any inaccuracies, errors, viruses, e.g., worms, trojan 
> > > horses, etc., or other items of a destructive nature, which may be 
> > > contained in this attachment and shall not be liable for direct, 
> > > indirect, consequential or special damages in connection with this e-mail 
> > > message or its attachment.


Re: Datastax encryption with kms

2018-09-04 Thread Rahul Singh
This is a Cassandra user group — consider joining the Datastax Academy Slack 
group and asking there.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Sep 4, 2018, 1:52 PM -0500, Rahul Reddy , wrote:
> Hello,
>
> Has anyone done the dse cassandra stable/ commitlog tde encryption saving the 
> keys in kms or vault instead of kmip. If it's possible please do let me know


Re: A blog about Cassandra in the IoT arena

2018-08-29 Thread Rahul Singh
Understood. Deep problems to consider.

Partition size.
I’ve been looking at how Yugabyte is using “tablets” of data which have data. 
It’s an interesting proposition. .. it all comes down to the token based 
addressing - which is optimized as a single dimension array and I think this is 
part of the limitation.


The sorting problem is one of the oldest in the Industry. Maybe need to look at 
Kafka and Lucene. Between the two, there are some interesting patterns to 
reference the location of data and to store those references. The compaction 
process wouldn’t need to “sort” if there is an optimized index which orders the 
vectors and the location. Compacting files should be “dumb” operation if the 
“smart” index is ready as the task table. The major reason Cassandra is fast is 
because of the partitioner which effectively “indexes” the data into a node and 
into a token. We need to go one level deeper. Maybe it’s another compaction 
strategy that evenly distributes data by either threshold of size or maintain a 
certain number of sstables.

Don’t have any ideas yet on anything better than Merkle trees. Will get back to 
you with ideas or code.

Good stuff.

Rahul
On Aug 24, 2018, 12:06 PM -0400, DuyHai Doan , wrote:
> No what I meant by infinite partition is not auto sub-partitioning, even at 
> server-side. Ideally Cassandra should be able to support infinite partition 
> size and make compaction, repair and streaming of such partitions manageable:
>
> - compaction: find a way to iterate super efficiently through the whole 
> partition and merge-sort all sstables containing data of the same partition.
>
>  - repair: find another approach than Merkle tree because its resolution is 
> not granular enough. Ideally repair resolution should be at the clustering 
> level or every xxx clustering values
>
>  - streaming: same idea as repair, in case of error/disconnection the stream 
> should be resumed at the latest clustering level checkpoint, or at least 
> should we checkpoint every xxx clustering values
>
>  - partition index: find a way to index efficiently the huge partition. Right 
> now huge partition has a dramatic impact on partition index. The work of 
> Michael Kjellman on birch indices is going into the right direction 
> (CASSANDRA-9754)
>
> About tombstone, there is recently a research paper about Dotted DB and an 
> attempt to make delete without using tombstones: 
> http://haslab.uminho.pt/tome/files/dotteddb_srds.pdf
>
>
>
> > On Fri, Aug 24, 2018 at 12:38 AM, Rahul Singh 
> >  wrote:
> > > Agreed. One of the ideas I had on partition size is to automatically 
> > > synthetically shard based on some basic patterns seen in the data.
> > >
> > > It could be implemented as a tool that would create a new table with an 
> > > additional part of the key that is an automatic created shard, or it 
> > > would use an existing key and then migrate the data.
> > >
> > > The internal automatic shard would adjust as needed and keep 
> > > “Subpartitons” or “rowsets” but return the full partition given some 
> > > special CQL
> > >
> > > This is done today at the Data Access layer and he data model design but 
> > > it’s pretty much a step by step process that could be algorithmically 
> > > done.
> > >
> > > Regarding the tombstone — maybe we have another thread dedicated to 
> > > cleaning tombstones - separate from compaction. Depending on the amount 
> > > of tombstones and a threshold, it would be dedicated to deletion. It may 
> > > be an edge case , but people face issues with tombstones all the time 
> > > because they don’t know better.
> > >
> > > Rahul
> > > On Aug 23, 2018, 11:50 AM -0500, DuyHai Doan , 
> > > wrote:
> > > > As I used to tell some people, the day we make :
> > > >
> > > > 1. partition size unlimited, or at least huge partition easily 
> > > > manageable (compaction, repair, streaming, partition index file)
> > > > 2. tombstone a non-issue
> > > >
> > > > that day, Cassandra will dominate any other IoT technology out there
> > > >
> > > > Until then ...
> > > >
> > > > > On Thu, Aug 23, 2018 at 4:54 PM, Rahul Singh 
> > > > >  wrote:
> > > > > > Good analysis of how the different key structures affect use cases 
> > > > > > and performance. I think you could extend this article with 
> > > > > > potential evaluation of FiloDB which specifically tries to solve 
> > > > > > the OLAP issue with arbitrary queries.
> > > > > >
> >

RE: [EXTERNAL] Re: Re: bigger data density with Cassandra 4.0?

2018-08-29 Thread Rahul Singh
YugaByte is also another new dancer in the Cassandra dance. The data store is 
based on RocksDB — and it’s written in C++. Although they ar wire compliant 
with c* I’m pretty are everything under the hood is NOT a port like Scylla was 
initially.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Aug 29, 2018, 10:05 AM -0400, Durity, Sean R , 
wrote:
> If you are going to compare vs commercial offerings like Scylla and CosmosDB, 
> you should be looking at DataStax Enterprise. They are moving more quickly 
> than open source (IMO) on adding features and tools that enterprises really 
> need. I think they have some emerging tech for large/dense nodes, in 
> particular. The ability to handle different data model types (Graph and 
> Search) and embedded analytics sets it apart from plain Cassandra. Plus, they 
> have replaced Cassandra’s SEDA architecture to give it a significant boost in 
> performance. As a customer, I see the value in what they are doing.
>
>
> Sean Durity
> From: onmstester onmstester 
> Sent: Wednesday, August 29, 2018 7:43 AM
> To: user 
> Subject: [EXTERNAL] Re: Re: bigger data density with Cassandra 4.0?
>
> Could you please explain more about (you mean slower performance in compare 
> to Cassandra?)
> ---Hbase tends to be quite average for transactional data
>
> and about:
> ScyllaDB IDK, I'd assume they just sorted out streaming by learning from 
> C*'s mistakes.
> While ScyllaDB is a much younger project than Cassandra with so much less 
> usage and attention, Currently I encounter a dilemma on launching new 
> clusters which is: should i wait for Cassandra community to apply all 
> enhancement's and bug fixes that applied by their main competitors (Scylla DB 
> or Cosmos DB) or just switch to competitors (afraid of the new world!)?
> For example right now is there a motivation to handle more dense nodes in 
> near future?
>
> Again, Thank you for your time
>
> Sent using Zoho Mail
>
>
>  On Wed, 29 Aug 2018 15:16:40 +0430 kurt greaves  
> wrote 
>
> > quote_type
> > Most of the issues around big nodes is related to streaming, which is 
> > currently quite slow (should be a bit better in 4.0). HBase is built on top 
> > of hadoop, which is much better at large files/very dense nodes, and tends 
> > to be quite average for transactional data. ScyllaDB IDK, I'd assume they 
> > just sorted out streaming by learning from C*'s mistakes.
> >
> > On 29 August 2018 at 19:43, onmstester onmstester  
> > wrote:
> >
> > > quote_type
> > >
> > > Thanks Kurt,
> > > Actually my cluster has > 10 nodes, so there is a tiny chance to stream a 
> > > complete SSTable.
> > > While logically any Columnar noSql db like Cassandra, needs always to 
> > > re-sort grouped data for later-fast-reads and having nodes with big 
> > > amount of data (> 2 TB) would be annoying for this background process, 
> > > How is it possible that some of these databases like HBase and Scylla db 
> > > does not emphasis on small nodes (like Cassandra do)?
> > >
> > > Sent using Zoho Mail
> > >
> > >
> > >  Forwarded message 
> > > From : kurt greaves 
> > > To : "User"
> > > Date : Wed, 29 Aug 2018 12:03:47 +0430
> > > Subject : Re: bigger data density with Cassandra 4.0?
> > >  Forwarded message 
> > >
> > > > quote_type
> > > > My reasoning was if you have a small cluster with vnodes you're more 
> > > > likely to have enough overlap between nodes that whole SSTables will be 
> > > > streamed on major ops. As  N gets >RF you'll have less common ranges 
> > > > and thus less likely to be streaming complete SSTables. Correct me if 
> > > > I've misunderstood.
> > >
>
>
>
>
> The information in this Internet Email is confidential and may be legally 
> privileged. It is intended solely for the addressee. Access to this Email by 
> anyone else is unauthorized. If you are not the intended recipient, any 
> disclosure, copying, distribution or any action taken or omitted to be taken 
> in reliance on it, is prohibited and may be unlawful. When addressed to our 
> clients any opinions or advice contained in this Email are subject to the 
> terms and conditions expressed in any applicable governing The Home Depot 
> terms of business or client engagement letter. The Home Depot disclaims all 
> responsibility and liability for the accuracy and content of this attachment 
> and for any damages or losses arising from any inaccuracies, errors, viruses, 
> e.g., worms, trojan horses, etc., or other items of a destructive nature, 
> which may be contained in this attachment and shall not be liable for direct, 
> indirect, consequential or special damages in connection with this e-mail 
> message or its attachment.


Re: Tombstone experience

2018-08-24 Thread Rahul Singh
Thanks! Great tips on clearing tombstones. The TTL vs. business rules challenge 
is one we’ve seen in enterprises moving from relational to non-relational 
because there is no thought to planning a data retention policy.

Periodic business rules based cleaning via Spark works well if you use it to 
set a short TTL that you would have deleted and that will eventually clear our 
data depending on the value you set. My suggestion for those cases where you 
must do business rules deletions, use a continuous spark job / Spark streaming 
on another DC to maintain data hygiene.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Aug 24, 2018, 1:46 AM -0400, Charulata Sharma (charshar) 
, wrote:
> Hi All,
>
>    I have shared my experience of tombstone clearing in this blog post.
> Sharing it in this forum for wider distribution.
>
> https://medium.com/cassandra-tombstones-clearing-use-case/the-curios-case-of-tombstones-d897f681a378
>
>
> Thanks,
> Charu


Re: How to rename the column name in Cassandra tables

2018-08-23 Thread Rahul Singh
Which documentation are you referring to? Which version of Cassandra?

https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlAlterTable.html

Renaming a column
The main purpose of RENAME is to change the names of CQL-generated primary key 
and column names that are missing from a legacy table. The following 
restrictions apply to the RENAME operation:
• You can only rename clustering columns, which are part of the primary key.
• You cannot rename the partition key.
• You can index a renamed column.
• You cannot rename a column if an index has been created on it.
• You cannot rename a static column, since you cannot use a static column in 
the table's primary key.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Aug 13, 2018, 7:42 AM -0500, Irtiza Ali , wrote:
> Hello everyone,
>
> Issue
> Currently, we are facing an issue of renaming the Cassandra table's column 
> name. According to the documentation, one can change the name of only those 
> columns that are part of primary or clustering columns(keys).
>
>
> Question
> Is there any way to rename the name of non-primary or clustering 
> columns(keys)?
>
> Thank you
>
> IA


Re: Fwd: Removing Extra Spaces and Row counts while using Capture Command

2018-08-23 Thread Rahul Singh
What’s your goal? Just output the results and save as JSON?


There may be a better way to do what you want.

https://github.com/tenmax/cqlkit/blob/master/README.md


Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Aug 13, 2018, 9:14 PM -0500, kumar bharath , 
wrote:
> >
> > Hi All,
> >
> > I am using Cassandra Capture Command to perform a select query operation to 
> > write data from a column family into JSON format file for further 
> > processing. I am able to do that successfully, but  I am seeing extra 
> > spaces and row count values after every few records.
> >
> > please suggest a to get rid of these unusual extra spaces and row count 
> > values.
> >
> > Regards,
> > Bharath Kumar B
>


Re: Cassandra 2.2.7 Compaction after Truncate issue

2018-08-23 Thread Rahul Singh
David ,

What CL do you set when running this command?

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Aug 14, 2018, 11:49 AM -0500, David Payne , wrote:
> Scenario: Cassandra 2.2.7, 3 nodes, RF=3 keyspace.
>
> Truncate a table.
> More than 24 hours later… FileCacheService is still reporting cold readers 
> for sstables of truncated data for node 2 and 3, but not node 1.
> The output of nodeool compactionstats shows stuck compaction for the 
> truncated table for node 2 and 3, but not node 1.
>
> This appears to be a defect that was fixed in 2.1.0. 
> https://issues.apache.org/jira/browse/CASSANDRA-7803
>
> Any ideas?
>
> Thanks,
> David Payne
>     | ̄ ̄|
> _☆☆☆_
> ( ´_⊃`)
> c. 303-717-0548
> dav...@cqg.com
>


Re: 90million reads

2018-08-23 Thread Rahul Singh
Agreed. If your data model is good and no major read latencies due to little or 
no data skew, wide partitions, or tombstones, you can literally scale linearly.

You could also consider having a plan in which you ramp up as the traffic 
increases.

Rahul Singh
Chief Executive Officer
m 202.905.2818

Anant Corporation
1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

We build and manage digital business technology platforms.
On Aug 14, 2018, 6:31 PM -0500, kurt greaves , wrote:
> Not a great idea to make config changes without testing. For a lot of changes 
> you can make the change on one node and measure of three is an improvement 
> however.
>
> You'd probably be best to add nodes (double should be sufficient), do tuning 
> and testing afterwards, and then decommission a few nodes if you can.
>
> > On Wed., 15 Aug. 2018, 05:00 Abdul Patel,  wrote:
> > > Currently our cassandra prod is 18 node 3 dc cluster and application does 
> > > 55 million reads per day and want to add load and make it 90 millon reads 
> > > per day.they need a guestimate of resources which we need to bump without 
> > > testing ..on top of my head we can increase heap and  native trasport 
> > > value ..any other paramters i should be concern?


Re: A blog about Cassandra in the IoT arena

2018-08-23 Thread Rahul Singh
Agreed. One of the ideas I had on partition size is to automatically 
synthetically shard based on some basic patterns seen in the data.

It could be implemented as a tool that would create a new table with an 
additional part of the key that is an automatic created shard, or it would use 
an existing key and then migrate the data.

The internal automatic shard would adjust as needed and keep “Subpartitons” or 
“rowsets” but return the full partition given some special CQL

This is done today at the Data Access layer and he data model design but it’s 
pretty much a step by step process that could be algorithmically done.

Regarding the tombstone — maybe we have another thread dedicated to cleaning 
tombstones - separate from compaction. Depending on the amount of tombstones 
and a threshold, it would be dedicated to deletion. It may be an edge case , 
but people face issues with tombstones all the time because they don’t know 
better.

Rahul
On Aug 23, 2018, 11:50 AM -0500, DuyHai Doan , wrote:
> As I used to tell some people, the day we make :
>
> 1. partition size unlimited, or at least huge partition easily manageable 
> (compaction, repair, streaming, partition index file)
> 2. tombstone a non-issue
>
> that day, Cassandra will dominate any other IoT technology out there
>
> Until then ...
>
> > On Thu, Aug 23, 2018 at 4:54 PM, Rahul Singh  
> > wrote:
> > > Good analysis of how the different key structures affect use cases and 
> > > performance. I think you could extend this article with potential 
> > > evaluation of FiloDB which specifically tries to solve the OLAP issue 
> > > with arbitrary queries.
> > >
> > > Another option is leveraging Elassandra (index in Elasticsearch 
> > > collocates with C*) or DataStax (index in Solr collocated with C*)
> > >
> > > I personally haven’t used SnappyData but that’s another Spark based DB 
> > > that could be leveraged for performance real-time queries on the OLTP 
> > > side.
> > >
> > > Rahul
> > > On Aug 23, 2018, 2:48 AM -0500, Affan Syed , wrote:
> > > > Hi,
> > > >
> > > > we wrote a blog about some of the results that engineers from AN10 
> > > > shared earlier.
> > > >
> > > > I am sharing it here for greater comments and discussions.
> > > >
> > > > http://www.an10.io/technology/cassandra-and-iot-queries-are-they-a-good-match/
> > > >
> > > >
> > > > Thank you.
> > > >
> > > >
> > > >
> > > > - Affan
>


Re: A blog about Cassandra in the IoT arena

2018-08-23 Thread Rahul Singh
Good analysis of how the different key structures affect use cases and 
performance. I think you could extend this article with potential evaluation of 
FiloDB which specifically tries to solve the OLAP issue with arbitrary queries.

Another option is leveraging Elassandra (index in Elasticsearch collocates with 
C*) or DataStax (index in Solr collocated with C*)

I personally haven’t used SnappyData but that’s another Spark based DB that 
could be leveraged for performance real-time queries on the OLTP side.

Rahul
On Aug 23, 2018, 2:48 AM -0500, Affan Syed , wrote:
> Hi,
>
> we wrote a blog about some of the results that engineers from AN10 shared 
> earlier.
>
> I am sharing it here for greater comments and discussions.
>
> http://www.an10.io/technology/cassandra-and-iot-queries-are-they-a-good-match/
>
>
> Thank you.
>
>
>
> - Affan


Re: Work in Progress - Awesome Cassandra Resources w/ Outline

2018-08-22 Thread Rahul Singh
Horia,

Thanks! I added the links. I can look into contributing this to the blog.. I 
mainly curate this list to be a definitive guide to eventually all cassandra 
related things, which would include Datastax, Scylla, Yugabyte, Cosmos, etc. 
which may ore may not be related to Apache Cassandra per se.


Thanks for the suggestion!

Rahul
On Aug 9, 2018, 3:55 AM -0500, Horia Mocioi , wrote:
> Hello Rahul,
>
> Great compilation of resources.
>
> Maybe add this one on the Blogs category? https://lostechies.com/ryansv
> ihla/tags
>
> This one is also quite good, I would say https://academy.datastax.com/s
> upport-blog/deeper-dive-diagnosing-dse-performance-issues-ttop-and-
> multidump
>
> And since now there is a official blog, wouldn't be good to have this
> resources there?
>
> Regards,
> Horia
>
> On ons, 2018-08-08 at 07:14 -0400, Rahul Singh wrote:
> > Folks,
> >
> > I've cleaned up the awesome-cassandra README which I've been working
> > on and published it as a github page. The goal is to make an
> > authoritative list of resources that sourced from the community.
> >
> > https://anant.github.io/awesome-cassandra
> >
> > TLDR:
> > 1. Work to be done: organizing more posts into the current outline so
> > that it's a logical organization of subject areas. e.g. all the blog
> > posts related to sstable management vs. shit related to hints.
> >
> > 2. Looking for:  sources specially from this community in the form of
> > existing blogs w/ multiple posts whether from individuals or
> > companies - so either make a pull request or just submit to me
> > directly - via email or issue.
> >
> > Thanks,
> >
> > https://anant.github.io/awesome-cassandra
> >
> >
> >
> > I'm still working on the searchable index w/ facets ... but that's
> > also a parallel work in progress.
> >
> >
> >
> > Make it a great week,
> > Rahul 
> > Т�ХF�V�7V'67&��R���âW6W"�V�7V'67&��G&�6�R��Фf�"FF�F6G2�R���âW6W"ֆV�676�G&�6�R��


Re: Repair daily refreshed table

2018-08-19 Thread Rahul Singh
If you wanted to be certain that all replicas were acknowledging receipt of the 
data, then you could use ALL or EACH_QUORUM ( if you have multiple DCs) but you 
must really want high consistency if you do that.

You should avoid consciously creating tombstones if possible — it ends up 
making reads slower because they need to be accounted for until they are 
compacted / garbage collected out.

Tombstones are created when data is either deleted, or nulled. When marking 
data with a TTL , the actual delete is not done until after the TTL has expired.

When you say you are overwriting, are you deleting and then loading? That’s the 
only way you should see tombstones — or maybe you are setting nulls?

Rahul
On Aug 18, 2018, 11:16 PM -0700, Maxim Parkachov , wrote:
> Hi Rahul,
>
> I'm already using LOCAL_QUORUM in batch process and it runs every day. As far 
> as I understand, because I'm overwriting whole table with new TTL, process 
> creates tons of thumbstones and I'm more concerned with them.
>
> Regards,
> Maxim.
>
> > On Sun, Aug 19, 2018 at 3:02 AM Rahul Singh  
> > wrote:
> > > Are you loading using a batch process? What’s the frequency of the data 
> > > Ingest and does it have to very fast. If not too frequent and can be a 
> > > little slower, you may consider a higher consistency to ensure data is on 
> > > replicas.
> > >
> > > Rahul
> > > On Aug 18, 2018, 2:29 AM -0700, Maxim Parkachov , 
> > > wrote:
> > > > Hi community,
> > > >
> > > > I'm currently puzzled with following challenge. I have a CF with 7 days 
> > > > TTL on all rows. Daily there is a process which loads actual data with 
> > > > +7 days TTL. Thus records which are not present in last 7 days of load 
> > > > expired. Amount of these expired records are very small < 1%. I have 
> > > > daily repair process, which take considerable amount of time and 
> > > > resources, and snapshot after that. Obviously I'm concerned only with 
> > > > the last loaded data. Basically, my question: should I run repair 
> > > > before load, after load or maybe I don't need to repair such table at 
> > > > all ?
> > > >
> > > > Regards,
> > > > Maxim.


Re: Repair daily refreshed table

2018-08-18 Thread Rahul Singh
Are you loading using a batch process? What’s the frequency of the data Ingest 
and does it have to very fast. If not too frequent and can be a little slower, 
you may consider a higher consistency to ensure data is on replicas.

Rahul
On Aug 18, 2018, 2:29 AM -0700, Maxim Parkachov , wrote:
> Hi community,
>
> I'm currently puzzled with following challenge. I have a CF with 7 days TTL 
> on all rows. Daily there is a process which loads actual data with +7 days 
> TTL. Thus records which are not present in last 7 days of load expired. 
> Amount of these expired records are very small < 1%. I have daily repair 
> process, which take considerable amount of time and resources, and snapshot 
> after that. Obviously I'm concerned only with the last loaded data. 
> Basically, my question: should I run repair before load, after load or maybe 
> I don't need to repair such table at all ?
>
> Regards,
> Maxim.


Re: ETL options from Hive/Presto/s3 to cassandra

2018-08-07 Thread Rahul Singh
Spark is scalable to as many nodes as you want and could be collocated with the 
data nodes — sstableloader wont be as performant for larger datasets. Although 
it can be run in parallel on different nodes I don’t believe it to be as fault 
tolerant.

If you have to do it continuously I would even think about leveraging Kafka as 
the transport layer and using Kafka Connect. It brings other tooling to get 
data into Cassandra from a variety of sources.

Rahul
On Aug 6, 2018, 3:16 PM -0400, srimugunthan dhandapani 
, wrote:
> Hi all,
> We have data that gets filled into Hive/ presto  every few hours.
> We want that data to be transferred to cassandra tables.
> What are some of the high performance ETL options for transferring data 
> between hive  or presto into cassandra?
>
> Also does anybody have any performance numbers comparing
> - loading data from S3 to cassandra using SStableloader
> - and loading data from S3 to cassandra using other means (like spark-api)?
>
> Thanks,
> mugunthan


Re: Hinted Handoff

2018-08-07 Thread Rahul Singh
What is the data size that you are talking about ? What is your compaction 
strategy?

I wouldn’t recommend having such an aggressive TTL. Why not put a clustering 
key that allows you to get the data fairly quickly but have a longer TTL?



Cassandra can still be used if the there is a legitimate need for multi-dc 
global replication and redundancy not quite available at the same level of 
uptime as in dist. Coaches like REDIS.


Rahul
On Aug 7, 2018, 1:19 AM -0400, kurt greaves , wrote:
> > Does Cassandra TTL out the hints after max_hint_window_in_ms? From my 
> > understanding, Cassandra only stops collecting hints after 
> > max_hint_window_in_ms but can still keep replaying the hints if the node 
> > comes back again. Is this correct? Is there a way to TTL out hints?
>
> No, but it won't send hints that have passed HH window. Also, this shouldn't 
> be caused by HH as the hints maintain the original timestamp with which they 
> were written.
>
> Honestly, this sounds more like a use case for a distributed cache rather 
> than Cassandra. Keeping data for 30 minutes and then deleting it is going to 
> be a nightmare to manage in Cassandra.
>
> > On 7 August 2018 at 07:20, Agrawal, Pratik  
> > wrote:
> > > Does Cassandra TTL out the hints after max_hint_window_in_ms? From my 
> > > understanding, Cassandra only stops collecting hints after 
> > > max_hint_window_in_ms but can still keep replaying the hints if the node 
> > > comes back again. Is this correct? Is there a way to TTL out hints?
> > >
> > > Thanks,
> > > Pratik
> > >
> > > From: Kyrylo Lebediev 
> > > Reply-To: "user@cassandra.apache.org" 
> > > Date: Monday, August 6, 2018 at 4:10 PM
> > > To: "user@cassandra.apache.org" 
> > > Subject: Re: Hinted Handoff
> > >
> > > Small gc_grace_seconds value lowers max allowed node downtime, which is 
> > > 15 minutes in your case. After 15 minutes of downtime you'll need to 
> > > replace the node, as you described. This interval looks too short to be 
> > > able to do planned maintenance. So, in case you set larger value for 
> > > gc_grace_seconds (lets say, hours or a day) will you get visible read 
> > > amplification / waste a lot of disk space / issues with compactions?
> > >
> > > Hinted handoff may be the reason in case hinted handoff window is longer 
> > > than gc_grace_seconds. To me it looks like hinted handoff window 
> > > (max_hint_window_in_ms in cassandra.yaml, which defaults to 3h) must 
> > > always be set to a value less than gc_grace_seconds.
> > >
> > > Regards,
> > > Kyrill
> > > From: Agrawal, Pratik 
> > > Sent: Monday, August 6, 2018 8:22:27 PM
> > > To: user@cassandra.apache.org
> > > Subject: Hinted Handoff
> > >
> > > Hello all,
> > > We use Cassandra in non-conventional way, where our data is short termed 
> > > (life cycle of about 20-30 minutes) where each record is updated ~5 times 
> > > and then deleted. We have GC grace of 15 minutes.
> > > We are seeing 2 problems
> > > 1.) A certain number of Cassandra nodes goes down and then we remove it 
> > > from the cluster using Cassandra removenode command and replace the dead 
> > > nodes with new nodes. While new nodes are joining in, we see more nodes 
> > > down (which are not actually down) but we see following errors in the log
> > > “Gossip not settled after 321 polls. Gossip Stage 
> > > active/pending/completed: 1/816/0”
> > >
> > > To fix the issue, I restarted the server and the nodes now appear to be 
> > > up and the problem is solved
> > >
> > > Can this problem be related to 
> > > https://issues.apache.org/jira/browse/CASSANDRA-6590 ?
> > >
> > > 2.) Meanwhile, after restarting the nodes mentioned above, we see that 
> > > some old deleted data is resurrected (because of short lifecycle of our 
> > > data). My guess at the moment is that these data is resurrected due to 
> > > hinted handoff. Interesting point to note here is that data keeps 
> > > resurrecting at periodic intervals (like an hour) and then finally stops. 
> > > Could this be caused by hinted handoff? if so is there any setting which 
> > > we can set to specify that “invalidate, hinted handoff data after 5-10 
> > > minutes”.
> > >
> > > Thanks,
> > > Pratik
>


Re: Huge daily outbound network traffic

2018-08-07 Thread Rahul Singh
Are you sure you don’t have an outside process that is doing an export , Spark 
job, non AWS managed backup process ?

Is this network out from Cassandra or from the network?


Rahul
On Aug 7, 2018, 4:09 AM -0400, Behnam B.Marandi , wrote:
> Hi,
> I have a 3 node Cassandra cluster (version 3.11.1) on m4.xlarge EC2 instances 
> with separate EBS volumes for root (gp2), data (gp2) and commitlog (io1).
> I get daily outbound traffic at a certain time everyday. As you can see in 
> the attached screenshot, whiile my normal networkl oad hardly meets 200MB, 
> this outbound (orange) spikes up to 2GB while inbound (purple) is less than 
> 800MB.
> There is no repair or backup process giong on in that time window, so I am 
> wondering where to look. Any idea?
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org


Re: Data model storage optimization

2018-07-29 Thread Rahul Singh
How many rows in average per partition?

Let me get this straight : You are bifurcating your partitions on either email 
or username , essentially potentially doubling the data because you don’t have 
a way to manage a central system of record of users ?

I would do this: (my opinion)
Migrate to a single sign on System that uses one or the other. Map and migrate 
your data to use a singular record as “identity”.

I know that seems painful but I _hate_ perpetuating bad design because someone 
, in the past, presence , or future chooses to not solve the problem but get 
around it.

This is not a storage optimization problem - it’s a data architecture problem.

Rahul
On Jul 28, 2018, 3:11 AM -0400, onmstester onmstester , 
wrote:
> The current data model described as table name: 
> ((partition_key),cluster_key),other_column1,other_column2,...
>
> user_by_name: ((time_bucket, username)),ts,request,email
> user_by_mail: ((time_bucket, email)),ts,request,username
>
> The reason that all 2 keys (username, email) repeated in all tables is that 
> there may be different username with the same email or different email with 
> same username, and the query for data model is:
> 1.  username = X
> 2. mail=Y
> 3. username = X and mail= Y (we query one of tables and because there is 
> small number of records in result, we filter the other column)
>
> This data model results in wasting lots of storage.
> I thought using UUID or hash code or sequence to handle this but i can't keep 
> track of the old vs new records (the ones that already have UUID).
> Any recommendation on optimizing data model to save storage?
>
> Sent using Zoho Mail
>
>


Re: optimization to cassandra-env.sh

2018-07-29 Thread Rahul Singh
Depends on which GC you are using but you can definitely manage GC - but you 
will always be stuck to the upper limit of memory.

I found the Hubspot gc visualizer and the associated blog post very helpful in 
the past.

https://github.com/HubSpot/gc_log_visualizer/blob/master/README.md


https://product.hubspot.com/blog/g1gc-fundamentals-lessons-from-taming-garbage-collection

Rahul
On Jul 26, 2018, 1:27 PM -0400, R1 J1 , wrote:
> Any one has tried to optimize or change cassandra-env.sh in an server 
> installation to make it use more heap size for garbage collection ?
> Any ideas ? We are having some oom issues and thinking if we have options 
> other than increasing RAM for that node.
>
> Regards
>


Re: cassandro nodes restarts

2018-07-29 Thread Rahul Singh
Need to review java gc, system , network, disk, memory, node, and table 
statistics. A lot can be discerned from visually examining the charts. Eg. if 
the nodes with the most local reads is failing or is it the one with the most 
writes or is it completely unrelated.

Since it’s a distributed system you need to review the data points together for 
all nodes. Data is the only way to see what’s going on. Either connect 
Prometheus / Grafana , get Datadog , New Relic, or something else to see the 
patterns across the cluster.

https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/

I assembled that list recently — I would even add that getting system logs into 
ELK or Splunk could also show some patterns otherwise not detected tailing and 
gripping.

Rahul
On Jul 26, 2018, 10:20 AM -0400, R1 J1 , wrote:
> Thanks for your prompt replies. No the same node is not bouncing over. When 
> you say it is about to tip over: What can we do to stop that ?
>
> Also about that error : you guys are correct: it is  a warning and might not 
> be contributing to the node bounce issue and it can be removed by changing 
> batch_size_warn_threshold_in_kb: 5
>
> R1J1
>
> > On Wed, Jul 25, 2018 at 10:32 PM, R1 J1  wrote:
> > > cassandro nodes restarts
> > >
> > >
> > >
> > > we see errors typically like these
> > >
> > >
> > > WARN  [Native-Transport-Requests-3] 2018-07-25 20:51:38,520 
> > > BatchStatement.java:301 - Batch for "keyspace.table"
> > >  is of size 19.386KiB, exceeding specified threshold of 5.000KiB by 
> > > 14.386KiB.
> > >
> > >
> > > Regards
> > > R1J1
>


Re: Cassandra crashes after loading data with sstableloader

2018-07-29 Thread Rahul Singh
What does “hash” Data look like?

Rahul
On Jul 24, 2018, 11:30 AM -0400, Arpan Khandelwal , wrote:
> I need to clone data from one keyspace to another keyspace.
> We do it by taking snapshot of keyspace1 and restoring in keyspace2 using 
> sstableloader.
>
> Suppose we have following table with index on hash column. Table has around 
> 10M rows.
> -
> CREATE TABLE message (
>  id     uuid,
>  messageid     uuid,
>  parentid     uuid,
>  label     text,
>  properties     map,
>  text1     text,
>  text2     text,
>  text3     text,
>  category     text,
>  hash     text,
>  info     map,
>  creationtimestamp     bigint,
>  lastupdatedtimestamp     bigint,
>  PRIMARY KEY ( (id) )
>  );
>
> CREATE  INDEX  ON message ( hash );
> -
> Cassandra crashes when i load data using sstableloader. Load is happening 
> correctly but seems that cassandra crashes when its trying to build index on 
> table with huge data.
>
> I have two questions.
> 1. Is there any better way to clone keyspace?
> 2. How can i optimize sstableloader to load data and not crash cassandra 
> while building index.
>
> Thanks
> Arpan


Work in Progress - Bringing it all together in one "Awesome Cassandra" README

2018-07-26 Thread Rahul Singh
Hope you all are are having an amazing week.

I recently updated https://github.com/Anant/awesome-cassandra/. I've been 
working on this while organizing the initial repository of links for a 
"Cassandra Hub" (Planet Cassandra 2.0) as I organize links on Cassandra and 
distributed computing (e.g. Kafka, Spark, Akka, Kubernetes, etc.) .

I've got about ~120 or so resources organized in this Readme, and I have a 
queue of another 100 or so. Please feel free to send me any focused Cassandra 
blogs related to development, architecture, or devops.

Thanks,

Rahul Singh
Chief Executive Officer | Internet Architecture
m 202.905.2818 | http://anant.us

1010 Wisconsin Ave NW, Suite 250
Washington, D.C. 20007

To empower people through the Internet to create a better world.

How are we doing? Please take our survey.

This email and any attachments to it may be confidential and are intended 
solely for the use of the individual to whom it is addressed. Any views or 
opinions expressed are solely those of the author and do not necessarily 
represent those of Anant Corporation. If you are not the intended recipient of 
this email, you must neither take any action based upon its contents, nor copy 
or show it to anyone. Please contact the sender if you believe you have 
received this email in error.


Re: Infinite loop of single SSTable compactions

2018-07-26 Thread Rahul Singh
Few questions


What is your maximumcompactedbytes across the cluster for this table ?
What’s your TTL ?
What does your data model look like as in what’s your PK?

Rahul
On Jul 25, 2018, 1:07 PM -0400, James Shaw , wrote:
> nodetool compactionstats  --- see compacting which table
> nodetool cfstats keyspace_name.table_name  --- check partition side, 
> tombstones
>
> go the data file directories:  look the data file size, timestamp,  --- 
> compaction will write to new temp file with _tmplink...,
>
> use sstablemetadata ...    look the largest or oldest one first
>
> of course, other factors may be,  like disk space, etc
> also what are compaction_throughput_mb_per_sec in cassandra.yaml
>
> Hope it is helpful.
>
> Thanks,
>
> James
>
>
>
>
> > On Wed, Jul 25, 2018 at 4:18 AM, Martin Mačura  wrote:
> > > Hi,
> > > we have a table which is being compacted all the time, with no change in 
> > > size:
> > >
> > > Compaction History:
> > > compacted_at            bytes_in    bytes_out   rows_merged
> > > 2018-07-25T05:26:48.101 57248063878 57248063878 {1:11655}
> > >
> > >                   2018-07-25T01:09:47.346 57248063878 57248063878
> > > {1:11655}
> > >                                          2018-07-24T20:52:48.652
> > > 57248063878 57248063878 {1:11655}
> > >
> > > 2018-07-24T16:36:01.828 57248063878 57248063878 {1:11655}
> > >
> > >                   2018-07-24T12:11:00.026 57248063878 57248063878
> > > {1:11655}
> > >                                          2018-07-24T07:28:04.686
> > > 57248063878 57248063878 {1:11655}
> > >
> > > 2018-07-24T02:47:15.290 57248063878 57248063878 {1:11655}
> > >
> > >                   2018-07-23T22:06:17.410 57248137921 57248063878
> > > {1:11655}
> > >
> > > We tried setting unchecked_tombstone_compaction to false, had no effect.
> > >
> > > The data is a time series, there will be only a handful of cell
> > > tombstones present. The table has a TTL, but it'll be least a month
> > > before it takes effect.
> > >
> > > Table properties:
> > >    AND compaction = {'class':
> > > 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> > > 'compaction_window_size': '1', 'compaction_window_unit': 'DAYS',
> > > 'max_threshold': '32', 'min_threshold': '4',
> > > 'unchecked_tombstone_compaction': 'false'}
> > >    AND compression = {'chunk_length_in_kb': '64', 'class':
> > > 'org.apache.cassandra.io.compress.LZ4Compressor'}
> > >    AND crc_check_chance = 1.0
> > >    AND dclocal_read_repair_chance = 0.0
> > >    AND default_time_to_live = 63072000
> > >    AND gc_grace_seconds = 10800
> > >    AND max_index_interval = 2048
> > >    AND memtable_flush_period_in_ms = 0
> > >    AND min_index_interval = 128
> > >    AND read_repair_chance = 0.0
> > >    AND speculative_retry = 'NONE';
> > >
> > > Thanks for any help
> > >
> > >
> > > Martin
> > >
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > >
>


Re: cassandro nodes restarts

2018-07-26 Thread Rahul Singh
Do the same nodes reboot or is it arbitrary? I’m wondering if it’s an isolated 
incident related to dat / traffic skew or could happen on any coordinator

Rahul
On Jul 26, 2018, 12:31 AM -0400, Jeff Jirsa , wrote:
> It’s a warning, but probably not causing you problems
>
> A 20kB batch is a hint that your batches are larger than Cassandra expects, 
> but the 5k limit for that logger was somewhat arbitrary, and I would be 
> shocked if 20kB batches were a problem unless you were already close to 
> tipping your cluster
>
> If I were you I’d disable that warning (or set it much higher).
>
> --
> Jeff Jirsa
>
>
> > On Jul 25, 2018, at 7:32 PM, R1 J1  wrote:
> >
> > cassandro nodes restarts
> >
> >
> >
> > we see errors typically like these
> >
> >
> > WARN [Native-Transport-Requests-3] 2018-07-25 20:51:38,520 
> > BatchStatement.java:301 - Batch for "keyspace.table"
> > is of size 19.386KiB, exceeding specified threshold of 5.000KiB by 
> > 14.386KiB.
> >
> >
> > Regards
> > R1J1
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>


Re: apache cassandra development process and future

2018-07-18 Thread Rahul Singh
YgaByte!!! <— another Cassandra “compliant" DB - not sure if they forked C* 
or wrote Cassandra in go. ;)
https://github.com/YugaByte/yugabyte-db

Datastax is Cassandra compliant — and can use the same sstables at least until 
6.0 (which uses a patched version of  “4.0” which is 2-5x faster) — and has the 
same actual tools that are in the OS version.

Here are some signals from the big players that are understanding it’s power 
and need.

1. Azure CosmosDB has a C* compliant API - seems like Managed C* under the 
hood. They used ElasticSearch to run their Azure Search …
2. Oracle now has a Datastax offering
3. Mesosphere offers supported versions of Cassandra and Datastax
4. Kubernetes and related purveyors use Cassandra as prime example as a part of 
a Kubernetes backed cloud agnostic orchestration framework
5. What Alain mentioned earlier.


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 18, 2018, 9:35 AM -0400, Alain RODRIGUEZ , wrote:
> Hello,
>
> It's a complex topic that has already been extensively discussed (at least 
> for the part about Datastax). I am sharing my personal understanding, from 
> what I read in the mailing list mostly:
>
> > Recently Cassandra eco system became very fragmented
>
> I would not put Scylladb in the same 'eco system' than Apache Cassandra. I 
> believed it is inspired by Cassandra and claim to be compatible with it up to 
> a certain point, but it's not the same software, thus not the same users and 
> community.
>
> About Datastax, I think they will give you a better idea of their position by 
> themselves here or through their support. I believe they also communicated 
> about it already. But in any case, I see Datastax more in the same 'eco 
> system' than Scylladb. Datastax uses a patched/forked version of Cassandra (+ 
> some other tools integrated with Cassandra and support). Plus it goes both 
> ways, Datastax greatly contributed to making Cassandra what it is now and 
> relies on it (or use to do so at least). I don't think that's the case for 
> Scylladb I don't see that much interest in connection/exchanges with 
> Scylladb, I mean no more than exchanging about DynamoDB for example. We can 
> make standards, compatibles features, compare performances, etc, but it's not 
> the same code base.
>
> > Since Datastax used to be the major participant to Cassandra
> > development and now it looks it goes on is own way, what is going to
> > be with the Apache Cassandra?
>
> Well, this is a fair point, that was discussed in the past, but to make it 
> short, Apache Cassandra is not dead or anything close. There is a lot of 
> activity. Some people are stepping out, other stepping in, and other 
> companies and individual are actively contributing to Cassandra. A version 
> 4.0 of Cassandra is being actively worked on at the moment. If these topics 
> are of interest, you might want to join the "Cassandra dev" mailing list 
> (http://cassandra.apache.org/community/).
>
> > If there are any other active participants in development?
>
> Yes, directly or by open sourcing internal tools quite a few companies have 
> contributed and continue to contribute to the Apache Cassandra ecosystem. I 
> invite you to have a look directly at this dev mailing list and check 
> people's email, profiles or companies. Check the Jira as well :). I am not 
> into doing this kind of stuff that much myself, I am not following this 
> closely but I can name for sure Apple, Netflix, The Last Pickle (my company), 
> Instaclustr I believe as well and many others that I am sorry not to name 
> here.
>
> Some people are working on Apache Cassandra for years and are around to help 
> regularly, they changed company but are still working on Cassandra, or even 
> changed company to work more with Apache Cassandra in some cases.
>
> > I'm also interested which distribution is the most popular at the
> > moment in production?
>
> I would say now you should start with C*3.0.last or C* 3.11.last. It seems to 
> be the general consensus in the mailing list lately.
> For Scylladb and Datastax I don't know about the version to use. You should 
> ask them directly.
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> > 2018-07-18 12:39 GMT+01:00 Vitaliy Semochkin :
> > > Hi,
> > >
> > > Recently Cassandra eco system became very fragmented:
> > >
> > > Scylladb provides solution based on Cassandra wire protocol claiming
> > > it is 10 times faster than Cassandra.
> > >
> > > Datastax provides it's own solution called DSE claiming i

Re: Cassandra node RAM amount vs data-per-node/total data?

2018-07-17 Thread Rahul Singh
If you have a read-heavy cluster, even with LCS you can still optimize by 
having more of the key / row cache in memory.

If you have a write heavy / read heavy , then you need more memory so that more 
data is available in the memtable as its written.

Not having enough memory means not enough heapspace, so unncessary GC pressure 
even with G1GC … which has STW pauses … eventually.

Non-response was generally due to GC pauses… (considering that Data model was 
good all around)
On Jul 17, 2018, 10:39 AM -0400, Vsevolod Filaretov , 
wrote:
> @Rahul Singh thank you for the answer!
>
> What is your logic behind such RAM-per-node values? What symptoms usually 
> suggest you that you need more RAM?
>
> Did you ever get C* node soft lockdowns/not-respondings due to node being 
> loaded up to 100% of either ram/cpu/IO? If yes - under which conditions?
>
> Thank you!
>
> Best regards,
> Vsevolod.
>
> > вт, 17 июл. 2018 г., 17:22 Rahul Singh :
> > > I usually don’t want to put more than 1.0-1.5 TB ( at the most ) per 
> > > node. It makes streaming slow beyond my patience and keeps the repair / 
> > > compaction processes lean. Memory depends on how much you plan to keep in 
> > > memory in terms of key / row cache. For my uses, no less than 64GB if not 
> > > more ~ 128GB. The lowest I’ve gone is 16GB but that’s for dev purposes 
> > > only.
> > >
> > > --
> > > Rahul Singh
> > > rahul.si...@anant.us
> > > https://www.anant.us/datastax
> > >
> > > Anant Corporation
> > > On Jul 17, 2018, 8:26 AM -0400, Vsevolod Filaretov 
> > > , wrote:
> > > > What are general community and/or your personal experience viewpoints 
> > > > on cassandra node RAM amount vs data stored per node question?
> > > >
> > > > Thank you very much.
> > > >
> > > > Best regards,
> > > > Vsevolod.


Re: Cassandra Repair

2018-07-17 Thread Rahul Singh
I would recommend at least looking at reaper before trying to engineer another 
way.
On Jul 17, 2018, 12:56 PM -0400, rajasekhar kommineni , 
wrote:
> nodetool tablestats has an attribute for Percent repaired, can we target the 
> tables based the % given.
>
>
>
> > On Jul 17, 2018, at 4:45 AM, Rahul Singh  
> > wrote:
> >
> > Have you considered looking into reaper project — could save you time in 
> > figuring out your own strategy. 
> > https://github.com/thelastpickle/cassandra-reaper
> >
> > Otherwise you can always do a round robin of cron jobs per node once a 
> > week… Your repair cycle should repair all servers within a window less than 
> > your shortest GC grace seconds.
> >
> > So if you have a GC of 10 days, you want to complete your repairs in 9 days…
> >
> >
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> > On Jul 16, 2018, 5:15 PM -0400, rajasekhar kommineni , 
> > wrote:
> > > Hello All,
> > >
> > >
> > > I have all cluster nodes in Cloud, and there is very rare chance for 
> > > nodes going down. I want to prepare repair strategy to my cluster, so 
> > > need inputs on any calculations to decide when to go for repair.
> > >
> > > Also let me know if my statement is correct or not "It’s not only node 
> > > down time,but the write consistency level is also a factor for regular 
> > > repairs”.
> > >
> > >
> > > Thanks,
> > > -
> > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > >
>


Re: Cassandra node RAM amount vs data-per-node/total data?

2018-07-17 Thread Rahul Singh
I usually don’t want to put more than 1.0-1.5 TB ( at the most ) per node. It 
makes streaming slow beyond my patience and keeps the repair / compaction 
processes lean. Memory depends on how much you plan to keep in memory in terms 
of key / row cache. For my uses, no less than 64GB if not more ~ 128GB. The 
lowest I’ve gone is 16GB but that’s for dev purposes only.

--
Rahul Singh
rahul.si...@anant.us
https://www.anant.us/datastax

Anant Corporation
On Jul 17, 2018, 8:26 AM -0400, Vsevolod Filaretov , 
wrote:
> What are general community and/or your personal experience viewpoints on 
> cassandra node RAM amount vs data stored per node question?
>
> Thank you very much.
>
> Best regards,
> Vsevolod.


RE: [EXTERNAL] New cluster vs Increasing nodes to already existed cluster

2018-07-17 Thread Rahul Singh
You can make new clusters or you can isolate with datacenters that don’t have a 
keyspace replicated.
On Jul 16, 2018, 10:41 AM -0400, Durity, Sean R , 
wrote:
> In most cases, we separate clusters by application. This does help with 
> isolating problems. A bad query in one application won’t affect other 
> applications. Also, you can then scale each cluster as required by the data 
> demands. You can also upgrade separately, which may be a huge help. You only 
> need one team’s testing (and driver change or whatever) before you can 
> upgrade. With a multi-tenant ring, you will need much more coordination for 
> any changes.
>
> There is a practical limit of the number of memtables per cluster, too. This 
> is somewhere in the low hundreds (200-300), based on the amount of RAM you 
> have per node.
>
>
> Sean Durity
>
> From: onmstester onmstester 
> Sent: Monday, July 16, 2018 9:17 AM
> To: "user" 
> Subject: [EXTERNAL] New cluster vs Increasing nodes to already existed cluster
>
> Currently i have a cluster with 10 nodes dedicated to one keyspace (Hardware 
> sizing been done according to input rate and ttl just for current application 
> requirements).
> I need a launch a new application with new keyspace with another set of 
> servers (8 nodes), there is no relation between the current and new 
> application. I have two option:
> 1. add new nodes to already existed cluster (10 nodes + 8 nodes) and share 
> the power and storage between the keyspace
> 2. create a new cluster for the new application (isolate clusters)
> Which option do you recommend and why ? (i care about of cost of maintenance, 
> performance (write and read), isolation of problems)
> Sent using Zoho Mail
>
>
>
>
> The information in this Internet Email is confidential and may be legally 
> privileged. It is intended solely for the addressee. Access to this Email by 
> anyone else is unauthorized. If you are not the intended recipient, any 
> disclosure, copying, distribution or any action taken or omitted to be taken 
> in reliance on it, is prohibited and may be unlawful. When addressed to our 
> clients any opinions or advice contained in this Email are subject to the 
> terms and conditions expressed in any applicable governing The Home Depot 
> terms of business or client engagement letter. The Home Depot disclaims all 
> responsibility and liability for the accuracy and content of this attachment 
> and for any damages or losses arising from any inaccuracies, errors, viruses, 
> e.g., worms, trojan horses, etc., or other items of a destructive nature, 
> which may be contained in this attachment and shall not be liable for direct, 
> indirect, consequential or special damages in connection with this e-mail 
> message or its attachment.


Re: Cassandra Repair

2018-07-17 Thread Rahul Singh
Have you considered looking into reaper project — could save you time in 
figuring out your own strategy. 
https://github.com/thelastpickle/cassandra-reaper

Otherwise you can always do a round robin of cron jobs per node once a week… 
Your repair cycle should repair all servers within a window less than your 
shortest GC grace seconds.

So if you have a GC of 10 days, you want to complete your repairs in 9 days…



--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 16, 2018, 5:15 PM -0400, rajasekhar kommineni , 
wrote:
> Hello All,
>
>
> I have all cluster nodes in Cloud, and there is very rare chance for nodes 
> going down. I want to prepare repair strategy to my cluster, so need inputs 
> on any calculations to decide when to go for repair.
>
> Also let me know if my statement is correct or not "It’s not only node down 
> time,but the write consistency level is also a factor for regular repairs”.
>
>
> Thanks,
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>


Re: Bind keyspace to specific data directory

2018-07-17 Thread Rahul Singh
What’s the goal, Abdul? Is it for security reasons or for organizational 
reasons. You could try prefixing / suffixing the keyspace names if its for 
organizational reasons (For now) if you don’t want to do the manual management 
of mounts as Anthony suggested .

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 16, 2018, 11:00 PM -0400, Anthony Grasso , 
wrote:
> Hi Abdul,
>
> There is no mechanism offered in Cassandra to bind a keyspace (when created) 
> to specific filesystem or directory. If multiple filesystems or directories 
> are specified in the data_file_directories property in the cassandra.yaml 
> then Cassandra will attempt to evenly distribute data from all keyspaces 
> across them.
>
> Cassandra places table directories for each keyspace in a folder under the 
> path(s) specified in the data_file_directories property. That is, if the 
> data_file_directories property was set to /var/lib/cassandra/data and 
> keyspace "foo" was created, Cassandra would create the directory 
> /var/lib/cassandra/data/foo.
>
> One possible way bind a keyspace to a particular file system is create a 
> custom mount point that has the same path as the keyspace. For example if you 
> had a particular volume that you wanted to use for keyspace "foo", you could 
> do something like:
>
> sudo mount / /var/lib/cassandra/data/foo
>
> Note that you would probably need to do this after the keyspace is created 
> and before the tables are created. This setup would mean that all 
> reads/writes for tables in keyspace "foo" would touch that volume.
>
> Regards,
> Anthony
>
> > On Tue, 3 Jul 2018 at 07:02, Abdul Patel  wrote:
> > > Hi
> > >
> > > Can we bind or specify while creating keyspace to bind to specific 
> > > filesystem or directory for writing?
> > > I see we can split data on multiple filesystems but can we decide while 
> > > fileystem a particular keyspace can read and write?


Re: Cassandra recommended server uptime?

2018-07-17 Thread Rahul Singh
It’s likely that if you have server stability issues its because of data model 
or compaction strategy configurations which lead to out of memory issues or 
massive GC pauses. Rebooting wouldn’t solve those issues.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 17, 2018, 7:28 AM -0400, Simon Fontana Oscarsson 
, wrote:
> Not anything that I'm aware of. Cassandra can run at months/years without 
> rebooting.
> It is better to monitor your nodes and if you find anything abnormal a 
> restart can help.
>
> --
> SIMON FONTANA OSCARSSON
> Software Developer
>
> Ericsson
> Ölandsgatan 1
> 37133 Karlskrona, Sweden
> simon.fontana.oscars...@ericsson.com
> www.ericsson.com
>
> On tis, 2018-07-17 at 12:09 +0300, Vsevolod Filaretov wrote:
> > Good time of day everyone;
> >
> > Does Cassandra have a "recommended uptime"? I.e., does regular Cassandra 
> > node reboots help anything? Are periodical node reboots recommended for 
> > general system stability?
> >
> > Best regards, Vsevolod.


Clarification needed on how triggers execute on batch mutations

2018-07-12 Thread Rahul Singh
Folks,

I have a question regarding how mutations from batch statements trigger
'TRIGGERS'

In unlogged batch, in a single partition mutation, I'm expecting one
partition to be affected and returned.. but does it trigger for each and
every row? In logged batch, in a single partition, I'm expecting the same
as above.

In logged batch, multi-partition mutation, I'm expecting multiple triggers
, one per partition mutated regardless of replicas (as in a single
partition mutation). In an unlogged batch, multi-partition mutation. I'm
expecting the same.

Since the coordinator does the write management , I am expecting that
regardless of whether I'm doing a logged or unlogged batch, the trigger on
any given table will only be triggered once per mutated partition.

Is my assumption correct?

Rahul Singh

Chief Executive Officer | Internet Architecture

https://www.anant.us/datastax


m 202.905.2818 | View my profile <http://linkedin.com/in/xingh> | Team
Office Hours <http://links.anant.us/rs.office.hours> | Appointment Calendar
<https://calendly.com/xingh/>

1010 Wisconsin Ave NW, Suite 250

Washington, D.C. 20007

To empower people through the Internet to create a better world.

How are we doing? Please take our survey.
<https://anantcorp.wufoo.com/forms/zqo0ylp0h8igra/>

This email and any attachments to it may be confidential and are intended
solely for the use of the individual to whom it is addressed. Any views or
opinions expressed are solely those of the author and do not necessarily
represent those of Anant Corporation. If you are not the intended recipient
of this email, you must neither take any action based upon its contents,
nor copy or show it to anyone. Please contact the sender if you believe you
have received this email in error.


Re: Jmx_exporter CPU spike

2018-07-10 Thread Rahul Singh
Nice find, Ben. I added this to my list of c* monitoring tools.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 9, 2018, 8:20 PM -0500, rajpal reddy , wrote:
> Thanks Ben!. will look into it
> > On Jul 9, 2018, at 10:42 AM, Ben Bromhead  wrote:
> >
> > Hi Rajpal
> >
> > I'd invite you to have a look at 
> > https://github.com/zegelin/cassandra-exporter
> >
> > Significantly faster (bypasses JMX rpc stuff, 10ms to collect metrics for 
> > 300 tables vs 2-3 seconds via JMX), plus the naming/tagging fits far better 
> > into the Prometheus world. Still missing a few stats like GC etc, but feel 
> > free to submit a PR!
> >
> > Ben
> >
> >
> >
> > > On Mon, Jul 9, 2018 at 12:03 AM Rahul Singh 
> > >  wrote:
> > > > How often are you polling the JMX? How much of a spike are you seeing 
> > > > in CPU?
> > > >
> > > > --
> > > > Rahul Singh
> > > > rahul.si...@anant.us
> > > >
> > > > Anant Corporation
> > > > On Jul 5, 2018, 2:45 PM -0500, rajpal reddy , 
> > > > wrote:
> > > > >
> > > > > we have Qualys security scan running causing the cpu spike. We are 
> > > > > seeing the CPU spike only when Jmx metrics are exposed using 
> > > > > Jmx_exporter. tried setting up imx authentication still see cpu 
> > > > > spike. if i stop using jmx exporter we don’t see any cpu spike. is 
> > > > > there any thing we have to tune to make work with Jmx_exporter?
> > > > >
> > > > >
> > > > > -
> > > > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > > > >
> > --
> > Ben Bromhead
> > CTO | Instaclustr
> > +1 650 284 9692
> > Reliability at Scale
> > Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
>


Re: Installation

2018-07-10 Thread Rahul Singh
That approach will work, however that may take a long time.

The important things that are unique to your cluster will be your configuration 
files & your data /log  directories.

The binaries can be placed on the same machines via tar installation. While 
keeping the machines running on the old binaries, you can migrate the data / 
log new directories. If you move your data, you can use links in linux to point 
the old directories to the new locations.

Once this is done, you can configure your tar installation to point to your new 
data directories, and turn off the old binaries and turn on the new binaries, 
one node at a time.


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 9, 2018, 6:35 PM -0500, rajpal reddy , wrote:
> We have our infrastructure in cloud so opted for adding new dc with tar.gz 
> then removed the old dc with package installation
>
> Sent from my iPhone
>
> > On Jul 9, 2018, at 2:23 PM, rajasekhar kommineni  
> > wrote:
> >
> > Hello All,
> >
> > I have a cassandra cluster where package installation is done, I want to 
> > convert it to tar.gz installation. Is there any procedure to follow.
> >
> > Thanks,
> > Rajasekhar Kommineni
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>


Re: Jmx_exporter CPU spike

2018-07-08 Thread Rahul Singh
How often are you polling the JMX? How much of a spike are you seeing in CPU?

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jul 5, 2018, 2:45 PM -0500, rajpal reddy , wrote:
>
> we have Qualys security scan running causing the cpu spike. We are seeing the 
> CPU spike only when Jmx metrics are exposed using Jmx_exporter. tried setting 
> up imx authentication still see cpu spike. if i stop using jmx exporter we 
> don’t see any cpu spike. is there any thing we have to tune to make work with 
> Jmx_exporter?
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>


Re: Is there a plan for Feature like this in C* ?

2018-07-03 Thread Rahul Singh
Some of my links related to Kafka and Cassandra

http://leaves.anant.us/#!/leaf/10767?tag=cassandra,kafka


Rahul
On Jul 3, 2018, 11:48 AM -0400, Joshua Galbraith 
, wrote:
> There is more info and background context on CDC here:
> https://issues.apache.org/jira/browse/CASSANDRA-8844
>
> > On Mon, Jul 2, 2018 at 9:26 PM, Justin Cameron  
> > wrote:
> > > Sorry - you'd need a source connector, not the sink.
> > >
> > > > On Tue, 3 Jul 2018 at 04:24 Justin Cameron  
> > > > wrote:
> > > > > Yeah, if you're using Kafka Connect you could use the Cassandra sink 
> > > > > connector
> > > > >
> > > > > > On Tue, 3 Jul 2018 at 02:37 Jeff Jirsa  wrote:
> > > > > > > Its a stable API - the project doesn’t ship a Kafka connector but 
> > > > > > > certainly people have written them
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jeff Jirsa
> > > > > > >
> > > > > > >
> > > > > > > On Jul 2, 2018, at 6:50 PM, Kant Kodali  wrote:
> > > > > > >
> > > > > > > > Hi Justin,
> > > > > > > >
> > > > > > > > Thanks, Looks like a very early stage feature and no 
> > > > > > > > integration with Kafka yet I suppose.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > > On Mon, Jul 2, 2018 at 6:24 PM, Justin Cameron 
> > > > > > > > >  wrote:
> > > > > > > > > > yes, take a look at 
> > > > > > > > > > http://cassandra.apache.org/doc/latest/operating/cdc.html
> > > > > > > > > >
> > > > > > > > > > > On Tue, 3 Jul 2018 at 01:20 Kant Kodali 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > https://www.cockroachlabs.com/docs/v2.1/change-data-capture.html
> > > > > > > > > > --
> > > > > > > > > > Justin Cameron
> > > > > > > > > > Senior Software Engineer
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This email has been sent on behalf of Instaclustr Pty. 
> > > > > > > > > > Limited (Australia) and Instaclustr Inc (USA).
> > > > > > > > > >
> > > > > > > > > > This email and any attachments may contain confidential and 
> > > > > > > > > > legally privileged information.  If you are not the 
> > > > > > > > > > intended recipient, do not copy or disclose its content, 
> > > > > > > > > > but please reply to this email immediately and highlight 
> > > > > > > > > > the error to the sender and then immediately delete the 
> > > > > > > > > > message.
> > > > > > > >
> > > > > --
> > > > > Justin Cameron
> > > > > Senior Software Engineer
> > > > >
> > > > >
> > > > > This email has been sent on behalf of Instaclustr Pty. Limited 
> > > > > (Australia) and Instaclustr Inc (USA).
> > > > >
> > > > > This email and any attachments may contain confidential and legally 
> > > > > privileged information.  If you are not the intended recipient, do 
> > > > > not copy or disclose its content, but please reply to this email 
> > > > > immediately and highlight the error to the sender and then 
> > > > > immediately delete the message.
> > > --
> > > Justin Cameron
> > > Senior Software Engineer
> > >
> > >
> > > This email has been sent on behalf of Instaclustr Pty. Limited 
> > > (Australia) and Instaclustr Inc (USA).
> > >
> > > This email and any attachments may contain confidential and legally 
> > > privileged information.  If you are not the intended recipient, do not 
> > > copy or disclose its content, but please reply to this email immediately 
> > > and highlight the error to the sender and then immediately delete the 
> > > message.
>
>
>
> --
> Joshua Galbraith | Lead Software Engineer | New Relic


Re: Is there a plan for Feature like this in C* ?

2018-07-03 Thread Rahul Singh
There is a source connector from Landoop for Kafka Connect but it is based on 
polling a “kcql” select statement. They claim to be working on a CDC source 
connector for Kafka Connect but I couldn’t find anything.

Smart Cat Labs has a CDC trigger based Kafka producer but I don’t think it uses 
Kafka Connect.

Theoretically you should be able to use the Smart Cat Labs CDC Kafka producer 
and then use that with Kafka Connect to write to else where.

Rahul
On Jul 3, 2018, 11:48 AM -0400, Joshua Galbraith 
, wrote:
> There is more info and background context on CDC here:
> https://issues.apache.org/jira/browse/CASSANDRA-8844
>
> > On Mon, Jul 2, 2018 at 9:26 PM, Justin Cameron  
> > wrote:
> > > Sorry - you'd need a source connector, not the sink.
> > >
> > > > On Tue, 3 Jul 2018 at 04:24 Justin Cameron  
> > > > wrote:
> > > > > Yeah, if you're using Kafka Connect you could use the Cassandra sink 
> > > > > connector
> > > > >
> > > > > > On Tue, 3 Jul 2018 at 02:37 Jeff Jirsa  wrote:
> > > > > > > Its a stable API - the project doesn’t ship a Kafka connector but 
> > > > > > > certainly people have written them
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > Jeff Jirsa
> > > > > > >
> > > > > > >
> > > > > > > On Jul 2, 2018, at 6:50 PM, Kant Kodali  wrote:
> > > > > > >
> > > > > > > > Hi Justin,
> > > > > > > >
> > > > > > > > Thanks, Looks like a very early stage feature and no 
> > > > > > > > integration with Kafka yet I suppose.
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > > > On Mon, Jul 2, 2018 at 6:24 PM, Justin Cameron 
> > > > > > > > >  wrote:
> > > > > > > > > > yes, take a look at 
> > > > > > > > > > http://cassandra.apache.org/doc/latest/operating/cdc.html
> > > > > > > > > >
> > > > > > > > > > > On Tue, 3 Jul 2018 at 01:20 Kant Kodali 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > > https://www.cockroachlabs.com/docs/v2.1/change-data-capture.html
> > > > > > > > > > --
> > > > > > > > > > Justin Cameron
> > > > > > > > > > Senior Software Engineer
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This email has been sent on behalf of Instaclustr Pty. 
> > > > > > > > > > Limited (Australia) and Instaclustr Inc (USA).
> > > > > > > > > >
> > > > > > > > > > This email and any attachments may contain confidential and 
> > > > > > > > > > legally privileged information.  If you are not the 
> > > > > > > > > > intended recipient, do not copy or disclose its content, 
> > > > > > > > > > but please reply to this email immediately and highlight 
> > > > > > > > > > the error to the sender and then immediately delete the 
> > > > > > > > > > message.
> > > > > > > >
> > > > > --
> > > > > Justin Cameron
> > > > > Senior Software Engineer
> > > > >
> > > > >
> > > > > This email has been sent on behalf of Instaclustr Pty. Limited 
> > > > > (Australia) and Instaclustr Inc (USA).
> > > > >
> > > > > This email and any attachments may contain confidential and legally 
> > > > > privileged information.  If you are not the intended recipient, do 
> > > > > not copy or disclose its content, but please reply to this email 
> > > > > immediately and highlight the error to the sender and then 
> > > > > immediately delete the message.
> > > --
> > > Justin Cameron
> > > Senior Software Engineer
> > >
> > >
> > > This email has been sent on behalf of Instaclustr Pty. Limited 
> > > (Australia) and Instaclustr Inc (USA).
> > >
> > > This email and any attachments may contain confidential and legally 
> > > privileged information.  If you are not the intended recipient, do not 
> > > copy or disclose its content, but please reply to this email immediately 
> > > and highlight the error to the sender and then immediately delete the 
> > > message.
>
>
>
> --
> Joshua Galbraith | Lead Software Engineer | New Relic


Resources for Monitoring Cassandra, Spark, Solr

2018-07-02 Thread Rahul Singh
Folks,

We often get questions on monitoring here so I assembled this post with 
articles from those in the community as well as links to the component tools to 
give folks a more comprehensive listing.

https://blog.anant.us/resources-for-monitoring-datastax-cassandra-spark-solr-performance/

This is a work in progress and I'll update this with screenshots as well as 
with links from other contributors.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation


Re: C* in multiple AWS AZ's

2018-06-29 Thread Rahul Singh
Totally agree. GPFS for the win. EC2 multi region snitch is an automation tool 
like Ansible or Puppet. Unless you have two orders of magnitude more servers 
than you do now, you don’t need it.

Rahul
On Jun 29, 2018, 6:18 AM -0400, kurt greaves , wrote:
> Yes. You would just end up with a rack named differently to the AZ. This is 
> not a problem as racks are just logical. I would recommend migrating all your 
> DCs to GPFS though for consistency.
>
> > On Fri., 29 Jun. 2018, 09:04 Randy Lynn,  wrote:
> > > So we have two data centers already running..
> > >
> > > AP-SYDNEY, and US-EAST.. I'm using Ec2Snitch over a site-to-site tunnel.. 
> > > I'm wanting to move the current US-EAST from AZ 1a to 1e..
> > > I know all docs say use ec2multiregion for multi-DC.
> > >
> > > I like the GPFS idea. would that work with the multi-DC too?
> > > What's the downside? status would report rack of 1a, even though in 1e?
> > >
> > > Thanks in advance for the help/thoughts!!
> > >
> > >
> > > > On Thu, Jun 28, 2018 at 6:20 PM, kurt greaves  
> > > > wrote:
> > > > > There is a need for a repair with both DCs as rebuild will not stream 
> > > > > all replicas, so unless you can guarantee you were perfectly 
> > > > > consistent at time of rebuild you'll want to do a repair after 
> > > > > rebuild.
> > > > >
> > > > > On another note you could just replace the nodes but use GPFS instead 
> > > > > of EC2 snitch, using the same rack name.
> > > > >
> > > > > > On Fri., 29 Jun. 2018, 00:19 Rahul Singh, 
> > > > > >  wrote:
> > > > > > > Parallel load is the best approach and then switch your Data 
> > > > > > > access code to only access the new hardware. After you verify 
> > > > > > > that there are no local read / writes on the OLD dc and that the 
> > > > > > > updates are only via Gossip, then go ahead and change the 
> > > > > > > replication factor on the key space to have zero replicas in the 
> > > > > > > old DC. Then you can decommissioned.
> > > > > > >
> > > > > > > This way you are hundred percent sure that you aren’t missing any 
> > > > > > > new data. No need for a DC to DC repair but a repair is always 
> > > > > > > healthy.
> > > > > > >
> > > > > > > Rahul
> > > > > > > On Jun 28, 2018, 9:15 AM -0500, Randy Lynn , 
> > > > > > > wrote:
> > > > > > > > Already running with Ec2.
> > > > > > > >
> > > > > > > > My original thought was a new DC parallel to the current, and 
> > > > > > > > then decommission the other DC.
> > > > > > > >
> > > > > > > > Also my data load is small right now.. I know small is relative 
> > > > > > > > term.. each node is carrying about 6GB..
> > > > > > > >
> > > > > > > > So given the data size, would you go with parallel DC or let 
> > > > > > > > the new AZ carry a heavy load until the others are migrated 
> > > > > > > > over?
> > > > > > > > and then I think "repair" to cleanup the replications?
> > > > > > > >
> > > > > > > >
> > > > > > > > > On Thu, Jun 28, 2018 at 10:09 AM, Rahul Singh 
> > > > > > > > >  wrote:
> > > > > > > > > > You don’t have to use EC2 snitch on AWS but if you have 
> > > > > > > > > > already started with it , it may put a node in a different 
> > > > > > > > > > DC.
> > > > > > > > > >
> > > > > > > > > > If your data density won’t be ridiculous You could add 3 to 
> > > > > > > > > > different DC/ Region and then sync up. After the new DC is 
> > > > > > > > > > operational you can remove one at a time on the old DC and 
> > > > > > > > > > at the same time add to the new one.
> > > > > > > > > >
> > > > > > > > > > Rahul
> > > > > > > > > > On Jun 28, 2018, 9:03 AM -0500, Randy Lynn 
> > > > > > > > > > , w

Re: Check Cluster Health

2018-06-28 Thread Rahul Singh


When you run TPstats or Tablestats subcommands in nodetool you are actually 
accessing data inside Cassandra via JMX.

You can start there at first.

Rahul
On Jun 28, 2018, 10:55 AM -0500, Thouraya TH , wrote:
> Hi,
>
> Please, how can check the health of my cluster / data center using cassandra ?
> In fact i'd like to generate a hitory of the state of each node. an history 
> about the failure of my cluster ( 20% of failure in a day, 40% of failure in 
> a day etc...)
>
> Thank you so much.
> Kind regards.


Re: C* in multiple AWS AZ's

2018-06-28 Thread Rahul Singh
Parallel load is the best approach and then switch your Data access code to 
only access the new hardware. After you verify that there are no local read / 
writes on the OLD dc and that the updates are only via Gossip, then go ahead 
and change the replication factor on the key space to have zero replicas in the 
old DC. Then you can decommissioned.

This way you are hundred percent sure that you aren’t missing any new data. No 
need for a DC to DC repair but a repair is always healthy.

Rahul
On Jun 28, 2018, 9:15 AM -0500, Randy Lynn , wrote:
> Already running with Ec2.
>
> My original thought was a new DC parallel to the current, and then 
> decommission the other DC.
>
> Also my data load is small right now.. I know small is relative term.. each 
> node is carrying about 6GB..
>
> So given the data size, would you go with parallel DC or let the new AZ carry 
> a heavy load until the others are migrated over?
> and then I think "repair" to cleanup the replications?
>
>
> > On Thu, Jun 28, 2018 at 10:09 AM, Rahul Singh 
> >  wrote:
> > > You don’t have to use EC2 snitch on AWS but if you have already started 
> > > with it , it may put a node in a different DC.
> > >
> > > If your data density won’t be ridiculous You could add 3 to different DC/ 
> > > Region and then sync up. After the new DC is operational you can remove 
> > > one at a time on the old DC and at the same time add to the new one.
> > >
> > > Rahul
> > > On Jun 28, 2018, 9:03 AM -0500, Randy Lynn , wrote:
> > > > I have a 6-node cluster I'm migrating to the new i3 types.
> > > > But at the same time I want to migrate to a different AZ.
> > > >
> > > > What happens if I do the "running node replace method" with 1 node at a 
> > > > time moving to the new AZ. Meaning, I'll have temporarily;
> > > >
> > > > 5 nodes in AZ 1c
> > > > 1 new node in AZ 1e.
> > > >
> > > > I'll wash-rinse-repeat till all 6 are on the new machine type and in 
> > > > the new AZ.
> > > >
> > > > Any thoughts about whether this gets weird with the Ec2Snitch and a RF 
> > > > 3?
> > > >
> > > > --
> > > > Randy Lynn
> > > > rl...@getavail.com
> > > >
> > > > office:
> > > > 859.963.1616 ext 202
> > > > 163 East Main Street - Lexington, KY 40507 - USA
> > > >
> > > > getavail.com
>
>
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
>
> getavail.com


Re: C* in multiple AWS AZ's

2018-06-28 Thread Rahul Singh
You don’t have to use EC2 snitch on AWS but if you have already started with it 
, it may put a node in a different DC.

If your data density won’t be ridiculous You could add 3 to different DC/ 
Region and then sync up. After the new DC is operational you can remove one at 
a time on the old DC and at the same time add to the new one.

Rahul
On Jun 28, 2018, 9:03 AM -0500, Randy Lynn , wrote:
> I have a 6-node cluster I'm migrating to the new i3 types.
> But at the same time I want to migrate to a different AZ.
>
> What happens if I do the "running node replace method" with 1 node at a time 
> moving to the new AZ. Meaning, I'll have temporarily;
>
> 5 nodes in AZ 1c
> 1 new node in AZ 1e.
>
> I'll wash-rinse-repeat till all 6 are on the new machine type and in the new 
> AZ.
>
> Any thoughts about whether this gets weird with the Ec2Snitch and a RF 3?
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
>
> getavail.com


Re: How do you monitoring Cassandra Cluster?

2018-06-21 Thread Rahul Singh
I’ve collected a bunch at http://leaves.anant.us/#!/?tag=cassandra,monitoring

I reommend Grafana / Prometheus if you don’t have DSE (which has OpsCenter)


--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jun 19, 2018, 1:06 PM -0400, Romain Gérard , wrote:
> Hi Felipe,
>
> You can use this project https://github.com/criteo/cassandra_exporter if you 
> are using Prometheus (Disclamer, I am one of the author of it).
> There is included a Grafana dashboard that aggregate metrics per cluster for 
> you, and in the "edit view" of each chart there are hidden queries that let 
> you drill down by nodes.
> In any case, you can look at the json in order to grasp how to aggregate 
> queries.
>
> Let me know if you need more information regarding the setup.
>
> If you end-up using the project, give a star to the project it is always 
> pleasant,
> Regards,
> Romain Gérard
>
>
> On Jun 18 2018, at 3:25 pm, Felipe Esteves  
> wrote:
> >
> > Hi, everyone,
> >
> > I'm running some tests to monitor Cassandra 3.x with jmx_exporter + 
> > prometheus + grafana.
> > I've managed to config it all and use the dashboard 
> > https://grafana.com/dashboards/5408
> >
> > However, I still can't aggregate metrics from all my cluster, just nodes 
> > individually.
> > Any tips on how to do that?
> >
> > Also, Opscenter gives some datacenter aggregation, I think it comes from 
> > nodetool as I didn't seen any metrics about that.
> > Anyone having success on that?
> >
> > cheers!
> >
> > Em qua, 28 de jun de 2017 às 19:43, Petrus Gomes  
> > escreveu:
> > > I'm using JMX+Prometheus and Grafana.
> > > JMX = https://github.com/prometheus/jmx_exporter
> > > Prometheus + Grafana = https://prometheus.io/docs/visualization/grafana/
> > >
> > > There are some dashboard examples like that: 
> > > https://grafana.com/dashboards/371
> > > Looks good.
> > >
> > > Thanks,
> > > Petrus Silva
> > >
> > > On Wed, Jun 28, 2017 at 5:55 AM, Peng Xiao <2535...@qq.com> wrote:
> > > > Dear All,
> > > >
> > > > we are currently using Cassandra 2.1.13,and it has grown to 5TB size 
> > > > with 32 nodes in one DC.
> > > > For monitoring,opsCenter does not  send alarm and not free in higher 
> > > > version.so we have to use a simple JMX+Zabbix template.And we plan to 
> > > > use Jolokia+JMX2Graphite to draw the metrics chart now.
> > > >
> > > > Could you please advise?
> > > >
> > > > Thanks,
> > > > Henry
> > >
> > >
> > >
> > >
> > > Esta mensagem pode conter informações confidenciais e somente o indivíduo 
> > > ou entidade a quem foi destinada pode utilizá-la. A transmissão incorreta 
> > > da mensagem não acarreta a perda de sua confidencialidade. Caso esta 
> > > mensagem tenha sido recebida por engano, solicitamos que o fato seja 
> > > comunicado ao remetente e que a mensagem seja eliminada de seu sistema 
> > > imediatamente. É vedado a qualquer pessoa que não seja o destinatário 
> > > usar, revelar, distribuir ou copiar qualquer parte desta mensagem. 
> > > Ambiente de comunicação sujeito a monitoramento.
> > >
> > > This message may include confidential information and only the intended 
> > > addresses have the right to use it as is, or any part of it. A wrong 
> > > transmission does not break its confidentiality. If you've received it 
> > > because of a mistake or erroneous transmission, please notify the sender 
> > > and delete it from your system immediately. This communication 
> > > environment is controlled and monitored.
> > >
> > > B2W Digital
> > >
> > >
> > --
> > Felipe Esteves
> >
> > Tecnologia
> >
> > felipe.este...@b2wdigital.com
> >
> > Tel.: (21) 3504-7162 ramal 57162


RE: [EXTERNAL] Re: Tombstone

2018-06-21 Thread Rahul Singh
Queues can be implemented in Cassandra even though everyone believes its an 
“anti-pattern” if the design is designed for Cassandra’s model.

In this case, I would do a logical / soft delete on the data to invalidate it 
from a query that accesses it and put a TTL on the data so it deletes 
automatically later. You could have a default TTL or set a TTL on on your 
actual “delete” which would put the delete in the future for example 3 days 
from now.

Some sources of inspiration on how people have been doing queues on Cassandra

cherami by Uber
CMB by Comcast
cassieq — don’t remember.



--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jun 19, 2018, 12:39 PM -0400, Durity, Sean R , 
wrote:
> This sounds like a queue pattern, which is typically an anti-pattern for 
> Cassandra. I would say that it is very difficult to get the access patterns, 
> tombstones, and everything else lined up properly to solve a queue problem.
>
>
> Sean Durity
>
> From: Abhishek Singh 
> Sent: Tuesday, June 19, 2018 10:41 AM
> To: user@cassandra.apache.org
> Subject: [EXTERNAL] Re: Tombstone
>
>                        The Partition key is made of datetime(basically date 
> truncated to hour) and bucket.I think your RCA may be correct since we are 
> deleting the partition rows one by one not in a batch files maybe overlapping 
> for the particular partition.A scheduled thread picks the rows for a 
> partition based on current datetime and bucket number and checks whether for 
> each row the entiry is past due or not, if yes we trigger a event and remove 
> the entry.
>
>
>
> On Tue 19 Jun, 2018, 7:58 PM Jeff Jirsa,  wrote:
> > The most likely explanation is tombstones in files that won’t be collected 
> > as they potentially overlap data in other files with a lower timestamp 
> > (especially true if your partition key doesn’t change and you’re writing 
> > and deleting data within a partition)
> >
> > --
> > Jeff Jirsa
> >
> >
> > > On Jun 19, 2018, at 3:28 AM, Abhishek Singh  wrote:
> > >
> > > Hi all,
> > >            We using Cassandra for storing events which are time series 
> > >based for batch processing once a particular batch based on hour is 
> > >processed we delete the entries but we were left with almost 18% deletes 
> > >marked as Tombstones.
> > >                  I ran compaction on the particular CF tombstone didn't 
> > >come down.
> > >             Can anyone suggest what is the optimal tunning/recommended 
> > >practice used for compaction strategy and GC_grace period with 100k 
> > >entries and deletes every hour.
> > >
> > > Warm Regards
> > > Abhishek Singh
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org


RE: how to avoid lightwieght transactions

2018-06-21 Thread Rahul Singh
A read before write is always going to be tremendously more than just writing. 
Depending on your architecture you may consider both of the options described.

If you have a CQRS architecture and are processing an event queue — doing LWT / 
read before write , then your “write” is processed asynchronously with YOUR 
command professor.

If you are directly doing interactions with Cassandra, and need extremely fast 
writes with no latency, I’d do append only method.

CQRS just separates the event processing from the reading — and when combined 
with asynchronous architecture in your application such as an event queue — 
basically mitigates / hedged performance loss in doing LWT.

You can always use CQRS without LWT.

Rahul
On Jun 21, 2018, 4:38 AM -0400, Jacques-Henri Berthemet 
, wrote:
> Hi,
>
> Another way would be to make your PK a clustering key with Id as PK and time 
> as clustering with type TimeUUID. Then you’ll always insert records, never 
> update, for each “transaction” you’ll keep a row in the partition. Then when 
> you’ll read all the rows for that partition by Id, you’ll process all of them 
> to know the real status. For example, if final status must be “completed” and 
> you have:
>
> Id, TimeUUI, status
> 1, t0, added
> 1, t1, added
> 1, t2, completed
> 1, t3, added
>
> When reading back you’ll just discard the last row.
>
>
> If you’re only concerned about “insert or update” case but the data is 
> actually the same you can always insert. If you insert on an existing record 
> it will just overwrite it, if you update without an existing record it will 
> insert data. In Cassandra there is not much difference between insert and 
> update operations.
>
> Regards,
> --
> Jacques-Henri Berthemet
>
> From: Rajesh Kishore [mailto:rajesh10si...@gmail.com]
> Sent: Thursday, June 21, 2018 7:45 AM
> To: user@cassandra.apache.org
> Subject: Re: how to avoid lightwieght transactions
>
> Hi,
>
> I think LWT feature is introduced for your kind of usecases only -  you don't 
> want other requests to be updating the same data at the same time using Paxos 
> algo(2 Phase commit).
> So, IMO your usecase makes perfect sense to use LWT to avoid concurrent 
> updates.
> If your issue is not the concurrent update one then IMHO you may want to 
> split this in two steps:
> - get the transcation_type with quorum factor (or higher consistency level)
> -  And conditionally update the row with with quorum factor (or higher 
> consistency level)
> But remember, this wont be atomic in nature and wont solve the concurrent 
> update issue if you have.
>
> Regards,
> Rajesh
>
>
>
> On Wed, Jun 20, 2018 at 2:59 AM, manuj singh  wrote:
> > quote_type
> > Hi all,
> > we have a use case where we need to update frequently our rows. Now in 
> > order to do so and so that we dont override updates we have to resort to 
> > lightweight transactions.
> > Since lightweight is expensive(could be 4 times as expensive as normal 
> > insert) , how do we model around it.
> >
> > e.g i have a table where
> >
> > CREATE TABLE multirow (
> >     id text,
> >     time text,
> >     transcation_type text,
> >     status text,
> >     PRIMARY KEY (id, time)
> > )
> >
> > So lets say we update status column multiple times. So first time we update 
> > we also have to make sure that the transaction exists otherwise normal 
> > update will insert it and then the original insert comes in and it will 
> > override the update.
> > So in order to fix that we need to use light weight transactions.
> >
> > Is there another way i can model this so that we can avoid the lightweight 
> > transactions.
> >
> >
> > Thanks
> >
>


Re: Options to replace hardware of the cluster

2018-06-14 Thread Rahul Singh
How much daa do you have and what is the timeline? If you can manage with a 
maintenance window the snapshot / move and restore method may be the fastest. 
Streaming data can take a long time to sync two DCs if there is a lot of data.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On Jun 14, 2018, 4:11 AM -0400, Christian Lorenz 
, wrote:
> Hi,
>
> we need to move our existing cassandra cluster to new hardware nodes. 
> Currently the cluster size is 8 members, they need to be moved to 8 new 
> machines. Cassandra version in use is 3.11.1.  Unfortunately we use 
> materialized views in production. I know that they have been marked 
> retroactively as experimental.
> What is a good way to move to the new machines? One-by-One, or setup a new 
> cluster as a separate DC? The move should be done without downtime of the 
> application.
>
> Do you have some advice for this kind of maintenance task?
>
> Kind regards,
> Christian


Re: Options to replace hardware of the cluster

2018-06-14 Thread Rahul Singh
For no downtime and no lost data, I would make a new DC in the same cluster, 
and wait for the data / MVs to stream over. Otherwise, the best way is to 
snapshot everything and bring up the nodes all at once.
On Jun 14, 2018, 4:11 AM -0400, Christian Lorenz 
, wrote:
> Hi,
>
> we need to move our existing cassandra cluster to new hardware nodes. 
> Currently the cluster size is 8 members, they need to be moved to 8 new 
> machines. Cassandra version in use is 3.11.1.  Unfortunately we use 
> materialized views in production. I know that they have been marked 
> retroactively as experimental.
> What is a good way to move to the new machines? One-by-One, or setup a new 
> cluster as a separate DC? The move should be done without downtime of the 
> application.
>
> Do you have some advice for this kind of maintenance task?
>
> Kind regards,
> Christian


Re: nodetool repair -pr

2018-06-08 Thread Rahul Singh
>From DS dox : "Do not use -pr with this option to repair only a local data 
>center."
On Jun 8, 2018, 10:42 AM -0400, user@cassandra.apache.org, wrote:
>
> nodetool repair -pr


Re: Certified Cassandra for Enterprise use

2018-05-31 Thread Rahul Singh
To be as objective as possible :

Product vendors
Datastax
Stratio

Infrastructure/ Database as a Service
Instaclustr
CosmosDB on Azure.

Container Orchestration
Mesosphere (DCOS creator) has limited support of “certified” Cassandra and DSE 
containers on Mesos


Disclosure : our firm is a DataStax services partner.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation
On May 29, 2018, 4:01 AM -0400, Ben Slater , wrote:
> Hi Pranay
>
> We (Instaclustr) provide enterprise support for Cassandra 
> (https://www.instaclustr.com/services/cassandra-support/) which may cover 
> what you are looking for.
>
> Please get in touch direct if you would like to discuss.
>
> Cheers
> Ben
>
> > On Tue, 29 May 2018 at 10:11 Pranay akula  
> > wrote:
> > > Is there any third party who provides security patches/releases for 
> > > Apache cassandra
> > >
> > > For Enterprise use is there any third party who provides certified Apache 
> > > cassandra packages ??
> > >
> > > Thanks
> > > Pranay
> --
> Ben Slater
> Chief Product Officer
>
>
> Read our latest technical blog posts here.
> This email has been sent on behalf of Instaclustr Pty. Limited (Australia) 
> and Instaclustr Inc (USA).
> This email and any attachments may contain confidential and legally 
> privileged information.  If you are not the intended recipient, do not copy 
> or disclose its content, but please reply to this email immediately and 
> highlight the error to the sender and then immediately delete the message.


Re: Fwd: Re: cassandra update vs insert + delete

2018-05-30 Thread Rahul Singh
Soft delete = logical delete - which is an update.

An update doesnt create a tombstone . It appends to the sstable,and when they 
are compacted, the latest write is what is seen as the definitive data.

 A tombstone by definition is an update which tells C* to remove the value that 
was there before, but doesn’t do it immediately .


On May 28, 2018, 2:32 AM -0400, onmstester onmstester , 
wrote:
> How update is working underneath?
> Does it create a new row (because i'm changing a column of partition key) and 
> add a tombstone to the old row?
>
> Sent using Zoho Mail
>
>
>  Forwarded message 
> From : Jonathan Haddad 
> To : 
> Date : Mon, 28 May 2018 00:07:36 +0430
> Subject : Re: cassandra update vs insert + delete
>  Forwarded message 
>
> > What is a “soft delete”?
> >
> > My 2 cents, if you want to update some information just update it. There’s 
> > no need to overthink it.
> >
> > Batches are good if they’re constrained to a single partition, not so hot 
> > otherwise.
> >
> >
> > On Sun, May 27, 2018 at 8:19 AM Rahul Singh  
> > wrote:
> >
> > --
> > Jon Haddad
> > http://www.rustyrazorblade.com
> > twitter: rustyrazorblade
> > > Deletes create tombstones — not really something to consider. Better to 
> > > add / update or insert data and do a soft delete on old data and apply a 
> > > TTL to remove it at a future time.
> > >
> > > --
> > > Rahul Singh
> > > rahul.si...@anant.us
> > >
> > > Anant Corporation
> > >
> > > On May 27, 2018, 5:36 AM -0400, onmstester onmstester 
> > > , wrote:
> > >
> > > > Hi
> > > > I want to load all rows from many partitions and change a column value 
> > > > in each row, which of following ways is better concerning disk space 
> > > > and performance?
> > > > 1. create a update statement for every row and batch update for each 
> > > > partitions
> > > > 2. create an insert statement for every row and batch insert for each 
> > > > partition, then run a single statement to delete the whole old partition
> > > >
> > > > Thanks in advance
> > > >
> > > > Sent using Zoho Mail
> > > >
>
>


Re: cassandra update vs insert + delete

2018-05-27 Thread Rahul Singh
Deletes create tombstones — not really something to consider. Better to add / 
update or insert data and do a soft delete on old data and apply a TTL to 
remove it at a future time.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 27, 2018, 5:36 AM -0400, onmstester onmstester <onmstes...@zoho.com>, 
wrote:
> Hi
> I want to load all rows from many partitions and change a column value in 
> each row, which of following ways is better concerning disk space and 
> performance?
> 1. create a update statement for every row and batch update for each 
> partitions
> 2. create an insert statement for every row and batch insert for each 
> partition, then run a single statement to delete the whole old partition
>
> Thanks in advance
>
> Sent using Zoho Mail
>
>


Re: EXT: Cassandra Monitoring tool

2018-05-25 Thread Rahul Singh
Good article about it on LI

https://www.linkedin.com/pulse/snap-cassandra-s3-tablesnap-vijaya-kumar-hosamani/

On May 25, 2018, 2:52 PM -0500, Joaquin Casares , 
wrote:
> Hello Aneesh,
>
> While this doesn't provide a GUI, tablesnap is a community tool that does a 
> great job at handling backups:
>
> > https://github.com/JeremyGrosser/tablesnap
>
> Cheers,
>
> Joaquin
>
> Joaquin Casares
> Consultant
> Austin, TX
>
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> > On Fri, May 25, 2018 at 11:36 AM, ANEESH KUMAR K.M  
> > wrote:
> > > Thank you Hari for the details.
> > >
> > > One more question, please suggest me some cluster managing too for 
> > > Cassandra cluster. Looking for opensource tools that will support to take 
> > > snapshot and restore via GUI.
> > >
> > > Regards,
> > > Aneesh
> > >
> > > > On Fri, May 25, 2018 at 10:00 PM, Harikrishnan Pillai 
> > > >  wrote:
> > > > > I assume you are using open source cassandra and you can look at 
> > > > > Prometheus grafana for cassandra monitoring and lot information 
> > > > > available in internet regarding how to setup the Prometheus 
> > > > > monitoring for cassandra .
> > > > >
> > > > >
> > > > > Sent from my iPhone
> > > > >
> > > > > > On May 25, 2018, at 9:23 AM, ANEESH KUMAR K.M  
> > > > > > wrote:
> > > > > >
> > > > > > Please suggest me some good cluster monitoring tool for cassandra 
> > > > > > multi region cluster.
> > > > > >
> > > > >
> > > > > -
> > > > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > > > >
> > >
>


Re: estimated number of keys vs ttl

2018-05-23 Thread Rahul Singh
If the TTL actually reduces the key count , should. It’s possible to TTL a row 
from a partition but not the whole partition. 1 key = 1 partition != 1 row != 1 
cell

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 23, 2018, 6:07 AM -0500, Grzegorz Pietrusza <gpietru...@gmail.com>, 
wrote:
> Hi
>
> I'm using tablestats to get estimated number of partitioning keys. In my case 
> all writes are done with TTL of a few days. Is the key count decreased when 
> TTL hits?
>
> Regards
> Grzegorz


Re: How to measure time to execute joinToCassandraTable

2018-05-13 Thread Rahul Singh
Anytime you want to measure distributed processes, you should look into 
logging, or sending the time data asynchronously to a persistent data store. I 
haven’t measured joinWithCassandra but I’ve measured other parts of spark using 
a Kafka topic to send execution times. I then consumed the times and saved them 
into Cassandra. I could then later get time aggregates and average times per 
operation.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On May 13, 2018, 4:14 PM -0500, Guillermo Ortiz <konstt2...@gmail.com>, wrote:
> I'm using the driver from Cassandra-Spark, I would like to know if there is a 
> way to measure the time which takes a " joinWithCassandraTable" to execute 
> easily inside a spark code.


Re: Determining active sstables and table- dir

2018-05-01 Thread Rahul Singh
Schema column families is the most authoritative. You may have different data 
directories.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 27, 2018, 1:24 PM -0700, Carl Mueller <carl.muel...@smartthings.com>, 
wrote:
> IN cases where a table was dropped and re-added, there are now two table 
> directories with different uuids with sstables.
>
> If you don't have knowledge of which one is active, how do you determine 
> which is the active table directory? I have tried cf_id from 
> system.schema_columnfamilies and that can work some of the time but have seen 
> times cf_id != table-
>
> I have also seen situations where sstables that don't have the 
> table/columnfamily are in the table dir and are clearly that active sstables 
> (they compacted when I did a nodetool compact)
>
> Is there a way to get a running cassandra node's sstables for a given 
> keyspace/table and what table- is active?
>
> This is in a 2.2.x environment that has probably churned a bit from 2.1.x


Re: GUI clients for Cassandra

2018-04-23 Thread Rahul Singh
Zeppelin and Dbeaver EE are both good.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 23, 2018, 12:53 AM -0400, Eunsu Kim <eunsu.bil...@gmail.com>, wrote:
> I am now writing dbeaver EE, but I’m waiting for TeamSQL (https://teamsql.io) 
> to support cassandra.
>
> > On 23 Apr 2018, at 7:56 AM, Tim Moore <tim.mo...@lightbend.com> wrote:
> >
> > I use the command-line too, but have heard some recommendations for DBeaver 
> > EE as a cross-database GUI with support for Cassandra: https://dbeaver.com/
> >
> > > On Sun, Apr 22, 2018 at 3:58 PM, Hannu Kröger <hkro...@gmail.com> wrote:
> > > > Hello everyone!
> > > >
> > > > I have been asked many times that what is a good GUI client for 
> > > > Cassandra. DevCenter is not available anymore and DataStax has a 
> > > > DevStudio but that’s for DSE only.
> > > >
> > > > Are there some 3rd party GUI tools that you are using a lot? I always 
> > > > use the command line client myself. I have tried to look for some 
> > > > Cassandra related tools but I haven’t found any good one yet.
> > > >
> > > > Cheers,
> > > > Hannu
> > > > -
> > > > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > > > For additional commands, e-mail: user-h...@cassandra.apache.org
> > > >
> >
> >
> >
> > --
> > Tim Moore
> > Lagom Tech Lead, Lightbend, Inc.
> > tim.mo...@lightbend.com
> > +61 420 981 589
> > Skype: timothy.m.moore
> >
>


Re: read repair with consistency one

2018-04-21 Thread Rahul Singh
Read repairs are one anti-entropy measure. Continuous repairs is another. If 
you do repairs via Reaper or your own method it will resolve your discrepencies.

On Apr 21, 2018, 3:16 AM -0400, Grzegorz Pietrusza , 
wrote:
> Hi all
>
> I'm a bit confused with how read repair works in my case, which is:
> - multiple DCs with RF 1 (NetworkTopologyStrategy)
> - reads with consistency ONE
>
>
> The article #1 says that read repair in fact runs RF reads for some percent 
> of the requests. Let's say I have read_repair_chance = 0.1. Does it mean that 
> 10% of requests will be read in all DCs (digest) and processed in a 
> background?
>
> On the other hand article #2 says that for consistency ONE read repair is not 
> performed. Does it mean that in my case read repair does not work at all? Is 
> there any way to enable read repair across DCs and stay will consistency ONE 
> for reads?
>
>
> #1 https://www.datastax.com/dev/blog/common-mistakes-and-misconceptions
> #2 
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesReadRepair.html
>
> Regards
> Grzegorz


Re: copy from one table to another

2018-04-21 Thread Rahul Singh
That’s correct.

On Apr 21, 2018, 5:05 AM -0400, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
wrote:
> You mean that correct table UUID should be specified as suffix in directory 
> name?
> For example:
>
> Table:
>
> cqlsh> select id from system_schema.tables where keyspace_name='test' and 
> table_name='usr';
>
>  id
> --
>  ea2f6da0-f931-11e7-8224-43ca70555242
>
>
> Directory name:
> ./data/test/usr-ea2f6da0f93111e7822443ca70555242
>
> Correct?
>
> Regards,
> Kyrill
> From: Rahul Singh <rahul.xavier.si...@gmail.com>
> Sent: Thursday, April 19, 2018 10:53:11 PM
> To: user@cassandra.apache.org
> Subject: Re: copy from one table to another
>
> Each table has a different Guid — doing a hard link may work as long as the 
> sstable dir’s guid is he same as the newly created table in the system schema.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 19, 2018, 10:41 AM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
> wrote:
> > The table is too large to be copied fast/effectively , so I'd like to 
> > leverage immutableness  property of SSTables.
> >
> > My idea is to:
> > 1) create new empty table (NewTable) with the same structure as existing 
> > one (OldTable)
> > 2) at some time run simultaneous 'nodetool snapshot -t ttt  
> > OldTable' on all nodes -- this will create point in time state of OldTable
> > 3) on each node run:
> >        for each file in OldTable ttt snapshot directory:
> >  ln 
> > //OldTable-/snapshots/ttt/_OldTable_xx 
> > .//Newtable/_NewTable_x
> >  then:
> >  nodetool refresh  NewTable
> > 4) nodetool repair NewTable
> > 5) Use OldTable and NewTable independently (Read/Write)
> >
> > Are there any issues with using hardlinks (ln) instead of copying (cp) in 
> > this case?
> >
> > Thanks,
> > Kyrill
> >
> > From: Rahul Singh <rahul.xavier.si...@gmail.com>
> > Sent: Wednesday, April 18, 2018 2:07:17 AM
> > To: user@cassandra.apache.org
> > Subject: Re: copy from one table to another
> >
> > 1. Make a new table with the same schema.
> > For each node
> > 2. Shutdown node
> > 3. Copy data from Source sstable dir to new sstable dir.
> >
> > This will do what you want.
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Apr 16, 2018, 4:21 PM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
> > wrote:
> > > Thanks,  Ali.
> > > I just need to copy a large table in production without actual copying by 
> > > using hardlinks. After this both tables should be used independently 
> > > (RW). Is this a supported way or not?
> > >
> > > Regards,
> > > Kyrill
> > > From: Ali Hubail <ali.hub...@petrolink.com>
> > > Sent: Monday, April 16, 2018 6:51:51 PM
> > > To: user@cassandra.apache.org
> > > Subject: Re: copy from one table to another
> > >
> > > If you want to copy a portion of the data to another table, you can also 
> > > use sstable cql writer. It is more of an advanced feature and can be 
> > > tricky, but doable.
> > > once you write the new sstables, you can then use the sstableloader to 
> > > stream the new data into the new table.
> > > check this out:
> > > https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
> > >
> > > I have recently used this to clean up 500 GB worth of sstable data in 
> > > order to purge tombstones that were mistakenly generated by the client.
> > > obviously this is not as fast as hardlinks + refresh, but it's much 
> > > faster and more efficient than using cql to copy data accross the tables.
> > > take advantage of CQLSSTableWriter.builder.sorted() if you can, and 
> > > utilize writetime if you have to.
> > >
> > > Ali Hubail
> > >
> > > Confidentiality warning: This message and any attachments are intended 
> > > only for the persons to whom this message is addressed, are confidential, 
> > > and may be privileged. If you are not the intended recipient, you are 
> > > hereby notified that any review, retransmission, conversion to hard copy, 
> > > copying, modification, circulation or other use of this message and any 
> > > attachments is strictly prohibited. If you receive this message in error, 
> > > please notify the sender immediately by ret

Re: copy from one table to another

2018-04-19 Thread Rahul Singh
Each table has a different Guid — doing a hard link may work as long as the 
sstable dir’s guid is he same as the newly created table in the system schema.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 19, 2018, 10:41 AM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
wrote:
> The table is too large to be copied fast/effectively , so I'd like to 
> leverage immutableness  property of SSTables.
>
> My idea is to:
> 1) create new empty table (NewTable) with the same structure as existing one 
> (OldTable)
> 2) at some time run simultaneous 'nodetool snapshot -t ttt  
> OldTable' on all nodes -- this will create point in time state of OldTable
> 3) on each node run:
>        for each file in OldTable ttt snapshot directory:
>  ln 
> //OldTable-/snapshots/ttt/_OldTable_xx 
> .//Newtable/_NewTable_x
>  then:
>  nodetool refresh  NewTable
> 4) nodetool repair NewTable
> 5) Use OldTable and NewTable independently (Read/Write)
>
> Are there any issues with using hardlinks (ln) instead of copying (cp) in 
> this case?
>
> Thanks,
> Kyrill
>
> From: Rahul Singh <rahul.xavier.si...@gmail.com>
> Sent: Wednesday, April 18, 2018 2:07:17 AM
> To: user@cassandra.apache.org
> Subject: Re: copy from one table to another
>
> 1. Make a new table with the same schema.
> For each node
> 2. Shutdown node
> 3. Copy data from Source sstable dir to new sstable dir.
>
> This will do what you want.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 16, 2018, 4:21 PM -0500, Kyrylo Lebediev <kyrylo_lebed...@epam.com>, 
> wrote:
> > Thanks,  Ali.
> > I just need to copy a large table in production without actual copying by 
> > using hardlinks. After this both tables should be used independently (RW). 
> > Is this a supported way or not?
> >
> > Regards,
> > Kyrill
> > From: Ali Hubail <ali.hub...@petrolink.com>
> > Sent: Monday, April 16, 2018 6:51:51 PM
> > To: user@cassandra.apache.org
> > Subject: Re: copy from one table to another
> >
> > If you want to copy a portion of the data to another table, you can also 
> > use sstable cql writer. It is more of an advanced feature and can be 
> > tricky, but doable.
> > once you write the new sstables, you can then use the sstableloader to 
> > stream the new data into the new table.
> > check this out:
> > https://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
> >
> > I have recently used this to clean up 500 GB worth of sstable data in order 
> > to purge tombstones that were mistakenly generated by the client.
> > obviously this is not as fast as hardlinks + refresh, but it's much faster 
> > and more efficient than using cql to copy data accross the tables.
> > take advantage of CQLSSTableWriter.builder.sorted() if you can, and utilize 
> > writetime if you have to.
> >
> > Ali Hubail
> >
> > Confidentiality warning: This message and any attachments are intended only 
> > for the persons to whom this message is addressed, are confidential, and 
> > may be privileged. If you are not the intended recipient, you are hereby 
> > notified that any review, retransmission, conversion to hard copy, copying, 
> > modification, circulation or other use of this message and any attachments 
> > is strictly prohibited. If you receive this message in error, please notify 
> > the sender immediately by return email, and delete this message and any 
> > attachments from your system. Petrolink International Limited its 
> > subsidiaries, holding companies and affiliates disclaims all responsibility 
> > from and accepts no liability whatsoever for the consequences of any 
> > unauthorized person acting, or refraining from acting, on any information 
> > contained in this message. For security purposes, staff training, to assist 
> > in resolving complaints and to improve our customer service, email 
> > communications may be monitored and telephone calls may be recorded.
> >
> >
> > Kyrylo Lebediev <kyrylo_lebed...@epam.com>
> > 04/16/2018 10:37 AM
> > Please respond to
> > user@cassandra.apache.org
> >
> > To
> > "user@cassandra.apache.org" <user@cassandra.apache.org>,
> > cc
> > Subject
> > Re: copy from one table to another
> >
> >
> >
> >
> >
> > Any issues if we:
> >
> > 1) create an new empty table with the same structure as the old one
> > 2) create hardlinks ("ln without -s"): 
> > .../-/--* ---> 
> > .

Re: Phantom growth resulting automatically node shutdown

2018-04-19 Thread Rahul Singh
I’ve seen something similar in 2.1. Our issue was related to file permissions 
being flipped due to an automation and C* stopped seeing Sstables so it started 
making new data — via read repair or repair processes.

In your case if nodetool is reporting data that means that it’s growing due to 
data growth. What does your cfstats / tablestats day? Are you monitoring your 
key tables data via cfstats metrics like SpaceUsedLive or SpaceUsedTotal. What 
is your snapshottjng / backup process doing?

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 19, 2018, 7:01 AM -0500, horschi <hors...@gmail.com>, wrote:
> Did you check the number of files in your data folder before & after the 
> restart?
>
> I have seen cases where cassandra would keep creating sstables, which 
> disappeared on restart.
>
> regards,
> Christian
>
>
> > On Thu, Apr 19, 2018 at 12:18 PM, Fernando Neves <fernando1ne...@gmail.com> 
> > wrote:
> > > > I am facing one issue with our Cassandra cluster.
> > > >
> > > > Details: Cassandra 3.0.14, 12 nodes, 7.4TB(JBOD) disk size in each 
> > > > node, ~3.5TB used physical data in each node, ~42TB whole cluster and 
> > > > default compaction setup. This size maintain the same because after the 
> > > > retention period some tables are dropped.
> > > >
> > > > Issue: Nodetool status is not showing the correct used size in the 
> > > > output. It keeps increasing the used size without limit until 
> > > > automatically node shutdown or until our sequential scheduled 
> > > > restart(workaround 3 times week). After the restart, nodetool shows the 
> > > > correct used space but for few days.
> > > > Did anybody have similar problem? Is it a bug?
> > > >
> > > > Stackoverflow: 
> > > > https://stackoverflow.com/questions/49668692/cassandra-nodetool-status-is-not-showing-correct-used-space
> > >
>


Re: where does c* store the schema?

2018-04-18 Thread Rahul Singh
Blake, you are right — although it’s he system keyspace not the system table. 
There are a few tables : schema_keyspaces, schema_columnfamilies, 
schema_columns which are correlated via cf_id , keyspace , columnfamilyname, 
and Columnname

I was thinking about the system_auth keyspace.

Jinhua,

It should catch up but every now and then if the changes are too great, it’s 
easier to run nodetool resetlocalschema 
https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsResetLocalSchema.html

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 18, 2018, 1:17 AM -0500, Jinhua Luo <luajit...@gmail.com>, wrote:
> What happens If one node has outdated version of schema, and it
> launches a write request upon that schema to other nodes as a
> coordinator?
> The received nodes would reject the coordinator?
>
>
>
>
> 2018-04-18 8:12 GMT+08:00 Blake Eggleston <beggles...@apple.com>:
> > Rahul, none of that is true at all.
> >
> >
> >
> > Each node stores schema locally in a non-replicated system table. Schema
> > changes are disseminated directly to live nodes (not the write path), and
> > the schema version is gossiped to other nodes. If a node misses a schema
> > update, it will figure this out when it notices that it’s local schema
> > version is behind the one being gossiped by the rest of the cluster, and
> > will pull the updated schema from the other nodes in the cluster.
> >
> >
> >
> > From: Rahul Singh <rahul.xavier.si...@gmail.com
> > Reply-To: <user@cassandra.apache.org
> > Date: Tuesday, April 17, 2018 at 4:13 PM
> > To: <user@cassandra.apache.org
> > Subject: Re: where does c* store the schema?
> >
> >
> >
> > It uses a “everywhere” replication strategy and its recommended to do all
> > alter / create / drop statements with consistency level all — meaning it
> > wouldn’t make the change to the schema if the nodes are up.
> >
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> >
> > On Apr 17, 2018, 12:31 AM -0500, Jinhua Luo <luajit...@gmail.com>, wrote:
> >
> > Yes, I know it must be in system schema.
> >
> > But how c* replicates the user defined schema to all nodes? If it
> > applies the same RWN model to them, then what's the R and W?
> > And when a failed node comes back to the cluster, how to recover the
> > schema updates it may miss during the outage?
> >
> > 2018-04-16 17:01 GMT+08:00 DuyHai Doan <doanduy...@gmail.com>:
> >
> > There is a system_schema keyspace to store all the schema information
> >
> > https://docs.datastax.com/en/cql/3.3/cql/cql_using/useQuerySystem.html#useQuerySystem__table_bhg_1bw_4v
> >
> > On Mon, Apr 16, 2018 at 10:48 AM, Jinhua Luo <luajit...@gmail.com> wrote:
> >
> >
> > Hi All,
> >
> > Does c* use predefined keyspace/tables to store the user defined schema?
> > If so, what's the RWN of those meta schema? And what's the procedure
> > to update them?
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>


Re: multiple table directories for system_schema keyspace

2018-04-17 Thread Rahul Singh
Happens to any keyspace — not just system — if there are competing processes 
initializing the system , creating / altering new things without CL=all it may 
do this. I ran into a scenario where when permissions were flipped to a non 
Cassandra user, the Cassandra daemon lost access to the data so it 
reinitialized the system.

--
Rahul Singh
rahul.si...@anant.us

Anant Corporation

On Apr 17, 2018, 2:25 PM -0500, John Sanda <john.sa...@gmail.com>, wrote:
> On a couple different occasions I have run into this exception at start up:
>
> Exception (org.apache.cassandra.exceptions.InvalidRequestException) 
> encountered during startup: Unknown type 
> org.apache.cassandra.exceptions.InvalidRequestException: Unknown type  type>
>         at 
> org.apache.cassandra.cql3.CQL3Type$Raw$RawUT.prepare(CQL3Type.java:745)
>         at 
> org.apache.cassandra.cql3.CQL3Type$Raw.prepareInternal(CQL3Type.java:533)
>         at 
> org.apache.cassandra.schema.CQLTypeParser.parse(CQLTypeParser.java:53)
>         at 
> org.apache.cassandra.schema.SchemaKeyspace.createColumnFromRow(SchemaKeyspace.java:1052)
>         at 
> org.apache.cassandra.schema.SchemaKeyspace.lambda$fetchColumns$12(SchemaKeyspace.java:1038)
>
> This was with Cassandra 3.0.12 running in Kubernetes, which means that IP 
> address changes for the Cassandra node can and will happen. Nowhere in client 
> code does the UDT get dropped. I came across 
> https://issues.apache.org/jira/browse/CASSANDRA-13739 which got me wondering 
> if this particular Cassandra node wound up with another version of the 
> system_schema.types table which did not have the UDT.
>
> In what circumstances could I end up with multiple table directories for the 
> tables in system_schema? Right now I am just guessing that I wound up with a 
> newer (or different) version of the system_schema.types table. Unfortunately, 
> I no longer have access to the environment to confirm/deny what was 
> happening. I just want to better understand so I can avoid it in the future.
>
>
> - John


  1   2   >