Re: How do u setup networking for Opening Solr Web Interface when on cloud?

2019-04-01 Thread Krish Donald
I have searched on internet but did not get any link which worked for me.

Even on
https://s3.amazonaws.com/quickstart-reference/datastax/latest/doc/datastax-enterprise-on-the-aws-cloud.pdf
it is mentioned to use SSH tunneling .

"DSE nodes have no public IP addresses. Access to the web consoles for Solr
or Spark can be established by using an SSH tunnel. For example, you can
access the Solr console from http://NODE_IP:8983/solr/. You can bind to a
local port with a command like the following (replacing the key and IP
values for those of your cluster): ssh -v -i $KEY_FILE -L
8983:$NODE_IP:8983 ubuntu@$OPSC_PUBLIC_IP -N The Solr console is then
accessible at http://127.0.0.1:8983/solr/. When you’re prompted to log in,
enter the user name cassandra and the password you chose. "

But i am not looking for SSH tunneling option.

I tried to follow below link as well:

https://forums.aws.amazon.com/thread.jspa?threadID=31406

But DSE nodes have no public IP addresses so this also did not work.

Thanks



On Mon, Apr 1, 2019 at 12:32 PM Rahul Singh 
wrote:

> This is probably not a question for this community... but rather for
> Datastax support or the Datastax Academy slack group. More specifically
> this is a "how to expose solr securely" question which is amply answered
> well on the interwebs if you look for it on Google.
>
>
> rahul.xavier.si...@gmail.com
>
> http://cassandra.link
>
> I'm speaking at #DataStaxAccelerate, the world’s premiere #ApacheCassandra
> conference, and I want to see you there! Use my code Singh50 for 50% off
> your registration. www.datastax.com/accelerate
>
>
> On Mon, Apr 1, 2019 at 12:19 PM Krish Donald  wrote:
>
>> Hi,
>>
>> We have DSE cassandra cluster running on AWS.
>> Now we have requirement to enable Solr and Spark on the cluster.
>> We have cassandra on private data subnet which has connectivity to app
>> layer.
>> From cassandra , we cant open direct Solr Web interface.
>> We tried using SSH tunneling and it is working but we cant give SSH
>> tunneling option to developers.
>>
>> We would like to create a Load Balancer  and put the cassandra nodes
>> under that load balancer but the question here is , what health check i
>> need to give for load balancer so that it can open the Solr Web UI ?
>>
>> My solution might not be perfect, please suggest any other solution if
>> you have ?
>>
>> Thanks
>>
>>


Re: Five Questions for Cassandra Users

2019-04-01 Thread Rahul Singh
Answers inline.


1.   Do the same people where you work operate the cluster and write
the code to develop the application?


No but the operators need to know development , data-modeling, and
generally how to "code" the application. (Coding is a low-level task of
assigning a code to a concept.. so I don't think that's the proper verb in
these scenarios.. engineering, or software development, or even programing
is a better term). It's because the developers are hired dime a dozen at
the B / C level and then replaced by D /E / F level developers as things go
on.. so the Data team eventually ends up being the expert of the
application and the data platform, and a "Center of Excellence" for the
development / architects to work with on a collaborative basis.



2.   Do you have a metrics stack that allows you to see graphs of
various metrics with all the nodes displayed together?



Yes. OpsCenter, ELK, Grafana, custom node data visualizers in excel
(because lines and charts don't tell you everything)


3.   Do you have a log stack that allows you to see the logs for all
the nodes together?

ELK. CloudWatch


4.   Do you regularly repair your clusters - such as by using Reaper?

 Depends. Cron, Reaper, OpsCenter Repair, and now NodeSync


5.   Do you use artificial intelligence to help manage your clusters?


Yes, I actually have made an artificial general intelligence called
Gravitron. It learns by ingesting all the news articles I aggregate about
Cassandra and the links I curate on cassandra.link into a solr/lucene index
and then using clustering find out the most popular and popularly connected
content. Once it does that there's a summarization of the content into
human readable content as well as interpreted bash code that gets pushed
into a "Recipe Book." As the master operator identifies scenarios using
english language, and then runs the bash commands, the machine slowly but
surely "wakes up" and starts to manage itself. It can also play Go , the
game, and beat IBM's AlphaGo at Go, and Donald Trump at golf while he was
cheating!



rahul.xavier.si...@gmail.com

http://cassandra.link

I'm speaking at #DataStaxAccelerate, the world’s premiere #ApacheCassandra
conference, and I want to see you there! Use my code Singh50 for 50% off
your registration. www.datastax.com/accelerate

































Happy april fools day.





On Thu, Mar 28, 2019 at 5:03 AM Kenneth Brotman
 wrote:

> I’m looking to get a better feel for how people use Cassandra in
> practice.  I thought others would benefit as well so may I ask you the
> following five questions:
>
>
>
> 1.   Do the same people where you work operate the cluster and write
> the code to develop the application?
>
>
>
> 2.   Do you have a metrics stack that allows you to see graphs of
> various metrics with all the nodes displayed together?
>
>
>
> 3.   Do you have a log stack that allows you to see the logs for all
> the nodes together?
>
>
>
> 4.   Do you regularly repair your clusters - such as by using Reaper?
>
>
>
> 5.   Do you use artificial intelligence to help manage your clusters?
>
>
>
>
>
> Thank you for taking your time to share this information!
>
>
>
> Kenneth Brotman
>


Re: How do u setup networking for Opening Solr Web Interface when on cloud?

2019-04-01 Thread Rahul Singh
This is probably not a question for this community... but rather for
Datastax support or the Datastax Academy slack group. More specifically
this is a "how to expose solr securely" question which is amply answered
well on the interwebs if you look for it on Google.


rahul.xavier.si...@gmail.com

http://cassandra.link

I'm speaking at #DataStaxAccelerate, the world’s premiere #ApacheCassandra
conference, and I want to see you there! Use my code Singh50 for 50% off
your registration. www.datastax.com/accelerate


On Mon, Apr 1, 2019 at 12:19 PM Krish Donald  wrote:

> Hi,
>
> We have DSE cassandra cluster running on AWS.
> Now we have requirement to enable Solr and Spark on the cluster.
> We have cassandra on private data subnet which has connectivity to app
> layer.
> From cassandra , we cant open direct Solr Web interface.
> We tried using SSH tunneling and it is working but we cant give SSH
> tunneling option to developers.
>
> We would like to create a Load Balancer  and put the cassandra nodes under
> that load balancer but the question here is , what health check i need to
> give for load balancer so that it can open the Solr Web UI ?
>
> My solution might not be perfect, please suggest any other solution if you
> have ?
>
> Thanks
>
>


Re: Best practices while designing backup storage system for big Cassandra cluster

2019-04-01 Thread Carl Mueller
At my current job I had to roll my own backup system. Hopefully I can get
it OSS'd at some point. Here is a (now slightly outdated) presentation:

https://docs.google.com/presentation/d/13Aps-IlQPYAa_V34ocR0E8Q4C8W2YZ6Jn5_BYGrjqFk/edit#slide=id.p

If you are struggling with the disk I/O cost of the sstable backups/copies,
note that since sstables are append-only, if you adopt an incremental
approach to your backups, you only need to track a list of the current
files and upload the files that are new compared to a previous successful
backup. Your "manifest" of files for a node will need to have references to
the previous backup, and you'll wnat to "reset" with a full backup each
month.

I stole that idea from https://github.com/tbarbugli/cassandra_snapshotter.
I would have used that but we had more complex node access modes
(kubernetes, ssh through jumphosts, etc) and lots of other features needed
that weren't supported.

In AWS I use aws profiles to throttle the transfers, and parallelize across
nodes. The basic unit of a successful backup is a single node, but you'll
obviously want to track overall node success.

Note that in rack-based topologies you really only need one whole
successful rack if your RF is > # racks, and one DC.

Beware doing simultaneous flushes/snapshots across the cluster at once,
that might be the equivalent of a DDos. You might want to do a "jittered"
randomized preflush of the cluster first before doing the snapshotting.

Unfortunately, the nature of a distributed system is that snapshotting all
the nodes at the precise same time is a hard problem.

I also do not / have not used the built-in incremental backup feature of
cassandra, which can enable more precise point-in-time backups (aside from
the unflushed data in the commitlogs)

A note on incrementals with occaisional FULLs: Note that FULL backups
monthly might take more than a day or two, especially throttled. My
incrementals were originally looking up previous manifests using only 'most
recent", but then the long-running FULL backups were excluded from the
"chain" of incremental backups. So I now implement a fuzzy lookup for the
incrementals that prioritizes any FULL in the last 5 days over any more
recent incremental. Thus you can purge old backups you don't need more
safely using the monthly full backups as a reset point.

On Mon, Apr 1, 2019 at 1:08 PM Alain RODRIGUEZ  wrote:

> Hello Manish,
>
> I think any disk works. As long as it is big enough. It's also better if
> it's a reliable system (some kind of redundant raid, NAS, storage like GCS
> or S3...). We are not looking for speed mostly during a backup, but
> resiliency and not harming the source cluster mostly I would say.
> Then how fast you write to the backup storage system will probably be more
> often limited by what you can read from the source cluster.
> The backups have to be taken from running nodes, thus it's easy to
> overload the disk (reads), network (export backup data to final
> destination), and even CPU (as/if the machine handles the transfer).
>
> What are the best practices while designing backup storage system for big
>> Cassandra cluster?
>
>
> What is nice to have (not to say mandatory) is a system of incremental
> backups. You should not take the data from the nodes every time, or you'll
> either harm the cluster regularly OR spend days to transfer the data (if
> the amount of data grows big enough).
> I'm not speaking about Cassandra incremental snapshots, but of using
> something like AWS Snapshot, or copying this behaviour programmatically to
> take (copy, link?) old SSTables from previous backups when they exist, will
> greatly unload the clusters work and the resource needed as soon enough a
> substantial amount of the data should be coming from the backup data source
> itself. The problem with incremental snapshot is that when restoring, you
> have to restore multiple pieces, making it harder and involving a lot of
> compaction work.
> The "caching" technic mentioned above gives the best of the 2 worlds:
> - You will always backup from the nodes only the sstables you don’t have
> already in your backup storage system,
> - You will always restore easily as each backup is a full backup.
>
> It's not really a "hands-on" writing, but this should let you know about
> existing ways to do backups and the tradeoffs, I wrote this a year ago:
> http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html
> .
>
> It's a complex topic, I hope some of this is helpful to you.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> Le jeu. 28 mars 2019 à 11:24, manish khandelwal <
> manishkhandelwa...@gmail.com> a écrit :
>
>> Hi
>>
>>
>>
>> I would like to know is there any guideline for selecting storage device
>> (disk type) for Cassandra backups.
>>
>>
>>
>> As per my current observation, NearLine 

Re: Best practices while designing backup storage system for big Cassandra cluster

2019-04-01 Thread Alain RODRIGUEZ
Hello Manish,

I think any disk works. As long as it is big enough. It's also better if
it's a reliable system (some kind of redundant raid, NAS, storage like GCS
or S3...). We are not looking for speed mostly during a backup, but
resiliency and not harming the source cluster mostly I would say.
Then how fast you write to the backup storage system will probably be more
often limited by what you can read from the source cluster.
The backups have to be taken from running nodes, thus it's easy to overload
the disk (reads), network (export backup data to final destination), and
even CPU (as/if the machine handles the transfer).

What are the best practices while designing backup storage system for big
> Cassandra cluster?


What is nice to have (not to say mandatory) is a system of incremental
backups. You should not take the data from the nodes every time, or you'll
either harm the cluster regularly OR spend days to transfer the data (if
the amount of data grows big enough).
I'm not speaking about Cassandra incremental snapshots, but of using
something like AWS Snapshot, or copying this behaviour programmatically to
take (copy, link?) old SSTables from previous backups when they exist, will
greatly unload the clusters work and the resource needed as soon enough a
substantial amount of the data should be coming from the backup data source
itself. The problem with incremental snapshot is that when restoring, you
have to restore multiple pieces, making it harder and involving a lot of
compaction work.
The "caching" technic mentioned above gives the best of the 2 worlds:
- You will always backup from the nodes only the sstables you don’t have
already in your backup storage system,
- You will always restore easily as each backup is a full backup.

It's not really a "hands-on" writing, but this should let you know about
existing ways to do backups and the tradeoffs, I wrote this a year ago:
http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html
.

It's a complex topic, I hope some of this is helpful to you.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le jeu. 28 mars 2019 à 11:24, manish khandelwal <
manishkhandelwa...@gmail.com> a écrit :

> Hi
>
>
>
> I would like to know is there any guideline for selecting storage device
> (disk type) for Cassandra backups.
>
>
>
> As per my current observation, NearLine (NL) disk on SAN  slows down
> significantly while copying backup files (taking full backup) from all node
> simultaneously. Will using SSD disk on SAN help us in this regard?
>
> Apart from using SSD disk, what are the alternative approach to make my
> backup process fast?
>
> What are the best practices while designing backup storage system for big
> Cassandra cluster?
>
>
> Regards
>
> Manish
>


How do u setup networking for Opening Solr Web Interface when on cloud?

2019-04-01 Thread Krish Donald
Hi,

We have DSE cassandra cluster running on AWS.
Now we have requirement to enable Solr and Spark on the cluster.
We have cassandra on private data subnet which has connectivity to app
layer.
>From cassandra , we cant open direct Solr Web interface.
We tried using SSH tunneling and it is working but we cant give SSH
tunneling option to developers.

We would like to create a Load Balancer  and put the cassandra nodes under
that load balancer but the question here is , what health check i need to
give for load balancer so that it can open the Solr Web UI ?

My solution might not be perfect, please suggest any other solution if you
have ?

Thanks