Re: Using K8s to Manage Cassandra in Production

2018-05-23 Thread vincent gromakowski
Thanks ! Do you have some pointers on the available features ? I am more
afraid of the lack of custom controller integration, for instance the code
generator...

2018-05-23 17:17 GMT+02:00 Ben Bromhead <b...@instaclustr.com>:

> The official Kubernetes Java driver is actually pretty feature complete,
> if not exactly idiomatic Java...  it's only missing full examples to get it
> to GOLD compatibility levels iirc.
>
> A few reasons we went down the Java path:
>
>- Cassandra community engagement was the primary concern. If you are a
>developer in the Cassandra community you have a base level of Java
>knowledge, so it means if you want to work on the Kubernetes operator you
>only have to learn 1 thing, Kubernetes. If the operator was in Go, you
>would then have two things to learn, Go and Kubernetes :)
>- We actually wrote an initial PoC in Go (based off the etcd operator,
>you can find it here https://github.com/benbromhead/cassandra-
>operator-old ), but because it was in Go we ended up making
>architectural decisions simply because Go doesn't do JMX, so it felt like
>we were just fighting different ecosystems just to be part of the cool
>group.
>
> Some other less important points weighed the decision in Java's favour:
>
>- The folk at Instaclustr all know Java, and are productive in it from
>day 1. Go is fun and relatively simple, but not our forte.
>-  Mature package management, Generics/inability to write DRY
>code, a million if err statements  (:
>- Some other awesome operators/controllers are written in JVM based
>languages. The sparkKubernetes resource manager (which is a k8s controller)
>is written in Scala.
>
>
> On Wed, May 23, 2018 at 10:04 AM vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Why did you choose java for the operator implementation when everybody
>> seems to use the go client (probably for greater functionalities) ?
>>
>> 2018-05-23 15:39 GMT+02:00 Ben Bromhead <b...@instaclustr.com>:
>>
>>> You can get a good way with StatefulSets, but as Tom mentioned there are
>>> still some issues with this, particularly around scaling up and down.
>>>
>>> We are working on an Operator for Apache Cassandra, you can find it here
>>> https://github.com/instaclustr/cassandra-operator. This is a joint
>>> project between Instaclustr, Pivotal and a few other folk.
>>>
>>> Currently it's a work in progress, but we would love any or all early
>>> feedback/PRs/issues etc. Our first GA release will target the following
>>> capabilities:
>>>
>>>- Safe scaling up and down (including decommissioning)
>>>- Backup/restore workflow (snapshots only initially)
>>>- Built in prometheus integration and discovery
>>>
>>> Other features like repair, better PV support, maybe even a nice
>>> dashboard will be on the way.
>>>
>>>
>>> On Wed, May 23, 2018 at 7:35 AM Tom Petracca <tpetra...@palantir.com>
>>> wrote:
>>>
>>>> Using a statefulset should get you pretty far, though will likely be
>>>> less effective than a coreos-style “operator”. Some random points:
>>>>
>>>>- For scale-up: a node shouldn’t report “ready” until it’s in the
>>>>NORMAL state; this will prevent multiple nodes from bootstrapping at 
>>>> once.
>>>>- For scale-down: as of now there isn’t a mechanism to know if a
>>>>pod is getting decommissioned because you’ve permanently lowered replica
>>>>count, or because it’s just getting bounced/re-scheduled, thus knowing
>>>>whether or not to decommission is basically impossible. Relevant issue:
>>>>kubernetes/kubernetes#1462
>>>><https://github.com/kubernetes/kubernetes/issues/1462>
>>>>
>>>>
>>>>
>>>> *From: *Pradeep Chhetri <prad...@stashaway.com>
>>>> *Reply-To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> *Date: *Friday, May 18, 2018 at 10:20 AM
>>>> *To: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>>>> *Subject: *Re: Using K8s to Manage Cassandra in Production
>>>>
>>>>
>>>>
>>>> Hello Hassaan,
>>>>
>>>>
>>>>
>>>> We use cassandra helm chart[0] for deploying cassandra over kubernetes
>>>> in production. We have around 200GB cas data. It works really well. You can
>>>> scale up nodes 

Re: Using K8s to Manage Cassandra in Production

2018-05-23 Thread vincent gromakowski
Why did you choose java for the operator implementation when everybody
seems to use the go client (probably for greater functionalities) ?

2018-05-23 15:39 GMT+02:00 Ben Bromhead :

> You can get a good way with StatefulSets, but as Tom mentioned there are
> still some issues with this, particularly around scaling up and down.
>
> We are working on an Operator for Apache Cassandra, you can find it here
> https://github.com/instaclustr/cassandra-operator. This is a joint
> project between Instaclustr, Pivotal and a few other folk.
>
> Currently it's a work in progress, but we would love any or all early
> feedback/PRs/issues etc. Our first GA release will target the following
> capabilities:
>
>- Safe scaling up and down (including decommissioning)
>- Backup/restore workflow (snapshots only initially)
>- Built in prometheus integration and discovery
>
> Other features like repair, better PV support, maybe even a nice dashboard
> will be on the way.
>
>
> On Wed, May 23, 2018 at 7:35 AM Tom Petracca 
> wrote:
>
>> Using a statefulset should get you pretty far, though will likely be less
>> effective than a coreos-style “operator”. Some random points:
>>
>>- For scale-up: a node shouldn’t report “ready” until it’s in the
>>NORMAL state; this will prevent multiple nodes from bootstrapping at once.
>>- For scale-down: as of now there isn’t a mechanism to know if a pod
>>is getting decommissioned because you’ve permanently lowered replica 
>> count,
>>or because it’s just getting bounced/re-scheduled, thus knowing whether or
>>not to decommission is basically impossible. Relevant issue:
>>kubernetes/kubernetes#1462
>>
>>
>>
>>
>> *From: *Pradeep Chhetri 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Friday, May 18, 2018 at 10:20 AM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Re: Using K8s to Manage Cassandra in Production
>>
>>
>>
>> Hello Hassaan,
>>
>>
>>
>> We use cassandra helm chart[0] for deploying cassandra over kubernetes in
>> production. We have around 200GB cas data. It works really well. You can
>> scale up nodes easily (I haven't tested scaling down).
>>
>>
>>
>> I would say that if you are worried about running cassandra over k8s in
>> production, maybe you should first try setting it for your
>> staging/preproduction and gain confidence over time.
>>
>>
>>
>> I have tested situations where i have killed the host running cassandra
>> container and have seen that container moves to a different node and joins
>> cluster properly. So from my experience its pretty good. No issues till yet.
>>
>>
>>
>> [0]: https://github.com/kubernetes/charts/tree/master/incubator/cassandra
>> [github.com]
>> 
>>
>>
>>
>>
>>
>> Regards,
>>
>> Pradeep
>>
>>
>>
>> On Fri, May 18, 2018 at 1:01 PM, Павел Сапежко 
>> wrote:
>>
>> Hi, Hassaan! For example we are using C* in k8s in production for our
>> video surveillance system. Moreover, we are using Ceph RBD as our storage
>> for cassandra. Today we have 8 C* nodes each manages 2Tb of data.
>>
>>
>>
>> On Fri, May 18, 2018 at 9:27 AM Hassaan Pasha  wrote:
>>
>> Hi,
>>
>>
>>
>> I am trying to craft a deployment strategy for deploying and maintaining
>> a C* cluster. I was wondering if there are actual production deployments of
>> C* using K8s as the orchestration layer.
>>
>>
>>
>> I have been given the impression that K8s managing a C* cluster can be a
>> recipe for disaster, especially if you aren't well versed with the
>> intricacies of a scale-up/down event. I know use cases where people are
>> using Mesos or a custom tool built with terraform/chef etc to run their
>> production clusters but have yet to find a real K8s use case.
>>
>>
>>
>> *Questions?*
>>
>> Is K8s a reasonable choice for managing a production C* cluster?
>>
>> Are there documented use cases for this?
>>
>>
>>
>> Any help would be greatly appreciated.
>>
>>
>>
>> --
>>
>> Regards,
>>
>>
>>
>> *Hassaan Pasha*
>>
>> --
>>
>> Regrads,
>>
>> Pavel Sapezhko
>>
>>
>>
> --
> Ben Bromhead
> CTO | Instaclustr 
> +1 650 284 9692
> Reliability at Scale
> Cassandra, Spark, Elasticsearch on AWS, Azure, GCP and Softlayer
>


Re: Sorl/DSE Spark

2018-04-12 Thread vincent gromakowski
Best practise is to use a dedicated DC for analytics separated from the hot
DC.

Le jeu. 12 avr. 2018 à 15:45, sha p  a écrit :

> Got it.
> Thank you so for your detailed explanation.
>
> Regards,
> Shyam
>
> On Thu, 12 Apr 2018, 17:37 Evelyn Smith,  wrote:
>
>> Cassandra tends to be used in a lot of web applications. It’s loads are
>> more natural and evenly distributed. Like people logging on throughout the
>> day. And people operating it tend to be latency sensitive.
>>
>> Spark on the other hand will try and complete it’s tasks as quickly as
>> possible. This might mean bulk reading from the Cassandra at 10 times the
>> usual operations load, but for only say 5 minutes every half hour (however
>> long it takes to read in the data for a job and whenever that job is run).
>> In this case during that 5 minutes your normal operations work (customers)
>> are going to experience a lot of latency.
>>
>> This even happens with streaming jobs, every time spark goes to interact
>> with Cassandra it does so very quickly, hammers it for reads and then does
>> it’s own stuff until it needs to write things out. This might equate to
>> intermittent latency spikes.
>>
>> In theory, you can throttle your reads and writes but I don’t know much
>> about this and don’t see people actually doing it.
>>
>> Regards,
>> Evelyn.
>>
>> On 12 Apr 2018, at 4:30 pm, sha p  wrote:
>>
>> Evelyn,
>> Can you please elaborate on below
>> Spark is notorious for causing latency spikes in Cassandra which is not
>> great if you are are sensitive to that.
>>
>>
>> On Thu, 12 Apr 2018, 10:46 Evelyn Smith,  wrote:
>>
>>> Are you building a search engine -> Solr
>>> Are you building an analytics function -> Spark
>>>
>>> I feel they are used in significantly different use cases, what are you
>>> trying to build?
>>>
>>> If it’s an analytics functionality that’s seperate from your operations
>>> functionality I’d build it in it’s own DC. Spark is notorious for causing
>>> latency spikes in Cassandra which is not great if you are are sensitive to
>>> that.
>>>
>>> Regards,
>>> Evelyn.
>>>
>>> On 12 Apr 2018, at 6:55 am, kooljava2 
>>> wrote:
>>>
>>> Hello,
>>>
>>> We are exploring on configuring Sorl/Spark. Wanted to get input on this.
>>> 1) How do we decide which one to use?
>>> 2) Do we run this on a DC where there is less workload?
>>>
>>> Any other suggestion or comments are appreciated.
>>>
>>> Thank you.
>>>
>>>
>>>
>>


Re: What kind of Automation you have for Cassandra related operations on AWS ?

2018-02-09 Thread vincent gromakowski
It will clearly follow your colleagues approach on the postgresql operator
https://github.com/zalando-incubator/postgres-operator

Just watch my repo for a first beta working version in the next weeks
https://github.com/vgkowski/cassandra-operator


2018-02-09 15:20 GMT+01:00 Oleksandr Shulgin <oleksandr.shul...@zalando.de>:

> On Fri, Feb 9, 2018 at 1:01 PM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Working on a Kubernetes operator for Cassandra (Alpha stage...)
>>
>
> I would love to learn more about your approach.  Do you have anything to
> show already?  Design docs / prototype?
>
> --
> Alex
>
>


Re: What kind of Automation you have for Cassandra related operations on AWS ?

2018-02-09 Thread vincent gromakowski
Working on a Kubernetes operator for Cassandra (Alpha stage...)

Le 9 févr. 2018 12:56 PM, "Oleksandr Shulgin" 
a écrit :

> On Fri, Feb 9, 2018 at 12:46 AM, Krish Donald 
> wrote:
>
>> Hi All,
>>
>> What kind of Automation you have for Cassandra related operations on AWS
>> like restacking, restart of the cluster , changing cassandra.yaml
>> parameters etc ?
>>
>
> We wrote some scripts customized for Zalando's STUPS platform:
> https://github.com/zalando-stups/planb-cassandra  (Warning! messy Python
> inside)
>
> We deploy EBS-backed instances with AWS EC2 auto-recovery enabled.
> Cassandra runs inside Docker on the EC2 hosts.
>
> The EBS setup allows us to perform rolling restarts / binary updates
> without streaming.
>
> Updating configuration parameters is a bit tricky since there are many
> places where different stuff is configured: cassandra-env.sh, jvm.options,
> cassandra.yaml and environment variables.  We don't have a comprehensive
> answer to that yet.
>
> Cheers,
> --
> Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
> 127-59-707 <+49%20176%2012759707>
>
>


Re: Pluggable throttling of read and write queries

2017-02-20 Thread vincent gromakowski
Aren't you using mesos Cassandra framework to manage your multiple clusters
? (Seen a presentation in cass summit)
What's wrong with your current mesos approach ?
I am also thinking it's better to split a large cluster into smallers
except if you also manage client layer that query cass and you can put some
backpressure or rate limit in it.

Le 21 févr. 2017 2:46 AM, "Edward Capriolo"  a
écrit :

> Older versions had a request scheduler api.
>
> On Monday, February 20, 2017, Ben Slater 
> wrote:
>
>> We’ve actually had several customers where we’ve done the opposite -
>> split large clusters apart to separate uses cases. We found that this
>> allowed us to better align hardware with use case requirements (for example
>> using AWS c3.2xlarge for very hot data at low latency, m4.xlarge for more
>> general purpose data) we can also tune JVM settings, etc to meet those uses
>> cases.
>>
>> Cheers
>> Ben
>>
>> On Mon, 20 Feb 2017 at 22:21 Oleksandr Shulgin <
>> oleksandr.shul...@zalando.de> wrote:
>>
>>> On Sat, Feb 18, 2017 at 3:12 AM, Abhishek Verma  wrote:
>>>
 Cassandra is being used on a large scale at Uber. We usually create
 dedicated clusters for each of our internal use cases, however that is
 difficult to scale and manage.

 We are investigating the approach of using a single shared cluster with
 100s of nodes and handle 10s to 100s of different use cases for different
 products in the same cluster. We can define different keyspaces for each of
 them, but that does not help in case of noisy neighbors.

 Does anybody in the community have similar large shared clusters and/or
 face noisy neighbor issues?

>>>
>>> Hi,
>>>
>>> We've never tried this approach and given my limited experience I would
>>> find this a terrible idea from the perspective of maintenance (remember the
>>> old saying about basket and eggs?)
>>>
>>> What potential benefits do you see?
>>>
>>> Regards,
>>> --
>>> Alex
>>>
>>> --
>> 
>> Ben Slater
>> Chief Product Officer
>> Instaclustr: Cassandra + Spark - Managed | Consulting | Support
>> +61 437 929 798 <+61%20437%20929%20798>
>>
>
>
> --
> Sorry this was sent from mobile. Will do less grammar and spell check than
> usual.
>


Re: cassandra user request log

2017-02-10 Thread vincent gromakowski
tx

2017-02-10 10:01 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:

> you could write a custom trigger that logs access to specific CFs. But be
> aware that this may have a big performance impact.
>
> 2017-02-10 9:58 GMT+01:00 vincent gromakowski <
> vincent.gromakow...@gmail.com>:
>
>> GDPR compliancy...we need to trace user activity on personal data. Maybe
>> there is another way ?
>>
>> 2017-02-10 9:46 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:
>>
>>> On a cluster with just a little bit load, that would cause zillions of
>>> petabytes of logs (just roughly ;)). I don't think this is viable.
>>> There are many many JMX metrics on an aggregated level. But none per
>>> authed used.
>>> What exactly do you want to find out? Is it for debugging purposes?
>>>
>>>
>>> 2017-02-10 9:42 GMT+01:00 vincent gromakowski <
>>> vincent.gromakow...@gmail.com>:
>>>
>>>> Hi all,
>>>> Is there any way to trace user activity at the server level to see
>>>> which user is accessing which data ? Do you thin it would be simple to
>>>> implement ?
>>>> Tx
>>>>
>>>
>>>
>>>
>>> --
>>> Benjamin Roth
>>> Prokurist
>>>
>>> Jaumo GmbH · www.jaumo.com
>>> Wehrstraße 46 · 73035 Göppingen · Germany
>>> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
>>> <+49%207161%203048801>
>>> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>>>
>>
>>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
> <+49%207161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>


Re: cassandra user request log

2017-02-10 Thread vincent gromakowski
GDPR compliancy...we need to trace user activity on personal data. Maybe
there is another way ?

2017-02-10 9:46 GMT+01:00 Benjamin Roth <benjamin.r...@jaumo.com>:

> On a cluster with just a little bit load, that would cause zillions of
> petabytes of logs (just roughly ;)). I don't think this is viable.
> There are many many JMX metrics on an aggregated level. But none per
> authed used.
> What exactly do you want to find out? Is it for debugging purposes?
>
>
> 2017-02-10 9:42 GMT+01:00 vincent gromakowski <
> vincent.gromakow...@gmail.com>:
>
>> Hi all,
>> Is there any way to trace user activity at the server level to see which
>> user is accessing which data ? Do you thin it would be simple to implement ?
>> Tx
>>
>
>
>
> --
> Benjamin Roth
> Prokurist
>
> Jaumo GmbH · www.jaumo.com
> Wehrstraße 46 · 73035 Göppingen · Germany
> Phone +49 7161 304880-6 <+49%207161%203048806> · Fax +49 7161 304880-1
> <+49%207161%203048801>
> AG Ulm · HRB 731058 · Managing Director: Jens Kammerer
>


cassandra user request log

2017-02-10 Thread vincent gromakowski
Hi all,
Is there any way to trace user activity at the server level to see which
user is accessing which data ? Do you thin it would be simple to implement ?
Tx


Re: [External] Re: Cassandra ad hoc search options

2017-01-31 Thread vincent gromakowski
You can also have a look at https://github.com/strapdata/elassandra


2017-01-31 9:50 GMT+01:00 vincent gromakowski <vincent.gromakow...@gmail.com
>:

> The problem with adhoc queries on casssandra (with spark or not) is the
> partition model of cassandra that needs to be respected to avoid full scan
> queries (the link you mentioned explains all of them). With FiloDB, which
> works on cassandra, you can pushdown predicates of the partition key and
> segment key in an arbitrary order resulting in less full scan
> queries. Another advantage is the computed columns that can also prune
> partitions or segments so reduce the reads based on a subpart of the key
> (like a timerange of 2 hours or 10 min).
> Anyway it's not magic and my personal analysis doesn't target filodb as a
> fully adhoc query solution but it's largely better than pure cassandra. You
> can easily have pushdown predicates on any combination of 1 to 3-5 columns
> depending on the dataset compared to pure cassandra where you need to
> provide a first key value to pushdown the second key predicate, then the
> third key...
>
> 2017-01-31 8:56 GMT+01:00 Yu, John <john...@sandc.com>:
>
>> Thanks. I thought you have given up Lucene for Spark, but it seems your
>> Lucene still works.
>>
>>
>>
>> Spark also has a Cassandra connector, and my questions were more towards
>> that.
>>
>> From https://github.com/datastax/spark-cassandra-connector/blob/
>> master/doc/3_selection.md, it seems there’re limitations on how much one
>> can select the data to support ad hoc queries. It seems mostly limited to
>> clustering columns. Maybe in other cases, it would result in full scan, but
>> that’s going to be very slow.
>>
>>
>>
>> Regards,
>>
>> John
>>
>>
>>
>> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com]
>> *Sent:* Monday, January 30, 2017 10:20 PM
>>
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>>
>>
>>
>> Hi,
>>
>> *Are you using the DataStax connector as well? *
>>
>> Yes, we used it to query on lucene index.
>>
>>
>>
>> *Does it support querying against any column well (not just clustering
>> columns)?*
>>
>> Yes it does. We used lucene particularly for this purpose.
>>
>> ( You can use :
>>
>> 1. https://github.com/Stratio/cassandra-lucene-index/blob/branc
>> h-3.0.10/doc/documentation.rst#searching
>>
>> 2. https://www.youtube.com/watch?v=Hg5s-hXy_-M
>>
>> for more details)
>>
>>
>>
>> *I’m wondering how it could build the index around them “on-the-fly”*
>>
>> You can build indexes at run time, but it takes time(took a lot of time
>> on our cluster. Plus, CPU utilization went through the roof)
>>
>>
>>
>> *did you use Spark for the full set of data or just partial*
>>
>> We weren't allowed to install spark ( tech decision)
>>
>> Some tech discussions going around for the bulk job ecosystem.
>>
>>
>>
>> Hence as a work around, we used a faster scan utility.
>>
>> For all the adhoc purposes/scripts, you could do a full scan.
>>
>>
>>
>> I hope it helps.
>>
>>
>>
>> Regards
>>
>>
>>
>>
>>
>> On Tue, Jan 31, 2017 at 4:11 AM, Yu, John <john...@sandc.com> wrote:
>>
>> A follow up question is: did you use Spark for the full set of data or
>> just partial? In our case, I feel we need all the data to support ad hoc
>> queries (with multiple conditional filters).
>>
>>
>>
>> Thanks,
>>
>> John
>>
>>
>>
>> *From:* Yu, John [mailto:john...@sandc.com]
>> *Sent:* Monday, January 30, 2017 12:04 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* RE: [External] Re: Cassandra ad hoc search options
>>
>>
>>
>> Thanks for the input! Are you using the DataStax connector as well? Does
>> it support querying against any column well (not just clustering columns)?
>> I’m wondering how it could build the index around them “on-the-fly”.
>>
>>
>>
>> Regards,
>>
>> John
>>
>>
>>
>> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com
>> <sidd.verma29.l...@gmail.com>]
>> *Sent:* Friday, January 27, 2017 12:15 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>>
>>
>>
>> Hi
>>
>> We used lucene stratio plugin with C*3

Re: [External] Re: Cassandra ad hoc search options

2017-01-31 Thread vincent gromakowski
The problem with adhoc queries on casssandra (with spark or not) is the
partition model of cassandra that needs to be respected to avoid full scan
queries (the link you mentioned explains all of them). With FiloDB, which
works on cassandra, you can pushdown predicates of the partition key and
segment key in an arbitrary order resulting in less full scan
queries. Another advantage is the computed columns that can also prune
partitions or segments so reduce the reads based on a subpart of the key
(like a timerange of 2 hours or 10 min).
Anyway it's not magic and my personal analysis doesn't target filodb as a
fully adhoc query solution but it's largely better than pure cassandra. You
can easily have pushdown predicates on any combination of 1 to 3-5 columns
depending on the dataset compared to pure cassandra where you need to
provide a first key value to pushdown the second key predicate, then the
third key...

2017-01-31 8:56 GMT+01:00 Yu, John :

> Thanks. I thought you have given up Lucene for Spark, but it seems your
> Lucene still works.
>
>
>
> Spark also has a Cassandra connector, and my questions were more towards
> that.
>
> From https://github.com/datastax/spark-cassandra-connector/
> blob/master/doc/3_selection.md, it seems there’re limitations on how much
> one can select the data to support ad hoc queries. It seems mostly limited
> to clustering columns. Maybe in other cases, it would result in full scan,
> but that’s going to be very slow.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com]
> *Sent:* Monday, January 30, 2017 10:20 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi,
>
> *Are you using the DataStax connector as well? *
>
> Yes, we used it to query on lucene index.
>
>
>
> *Does it support querying against any column well (not just clustering
> columns)?*
>
> Yes it does. We used lucene particularly for this purpose.
>
> ( You can use :
>
> 1. https://github.com/Stratio/cassandra-lucene-index/blob/
> branch-3.0.10/doc/documentation.rst#searching
>
> 2. https://www.youtube.com/watch?v=Hg5s-hXy_-M
>
> for more details)
>
>
>
> *I’m wondering how it could build the index around them “on-the-fly”*
>
> You can build indexes at run time, but it takes time(took a lot of time on
> our cluster. Plus, CPU utilization went through the roof)
>
>
>
> *did you use Spark for the full set of data or just partial*
>
> We weren't allowed to install spark ( tech decision)
>
> Some tech discussions going around for the bulk job ecosystem.
>
>
>
> Hence as a work around, we used a faster scan utility.
>
> For all the adhoc purposes/scripts, you could do a full scan.
>
>
>
> I hope it helps.
>
>
>
> Regards
>
>
>
>
>
> On Tue, Jan 31, 2017 at 4:11 AM, Yu, John  wrote:
>
> A follow up question is: did you use Spark for the full set of data or
> just partial? In our case, I feel we need all the data to support ad hoc
> queries (with multiple conditional filters).
>
>
>
> Thanks,
>
> John
>
>
>
> *From:* Yu, John [mailto:john...@sandc.com]
> *Sent:* Monday, January 30, 2017 12:04 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [External] Re: Cassandra ad hoc search options
>
>
>
> Thanks for the input! Are you using the DataStax connector as well? Does
> it support querying against any column well (not just clustering columns)?
> I’m wondering how it could build the index around them “on-the-fly”.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com
> ]
> *Sent:* Friday, January 27, 2017 12:15 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi
>
> We used lucene stratio plugin with C*3.0.3
>
>
>
> Helped to solve a lot of some read patterns. Served well for prefix.
>
> But created problems as repairs failed repeatedly.
>
> We might have used it sub optimally, not sure.
>
>
>
> Later, we had to do away with it, and tried to serve most of the read
> patterns with materialised views. (currently C*3.0.9)
>
>
>
> Currently, for adhoc querries, we use spark or full scan.
>
>
>
> Regards,
>
>
>
> On Fri, Jan 27, 2017 at 1:03 PM, Yu, John  wrote:
>
> Thanks a lot. Mind sharing a couple of points where you feel it’s better
> than the alternatives.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* Thursday, January 26, 2017 2:33 PM
> *To:* user@cassandra.apache.org
> *Subject:* [External] Re: Cassandra ad hoc search options
>
>
>
> > With Cassandra, what are the options for ad hoc query/search similar to
> RDBMS?
>
>
>
> Your best options are Spark w/ the DataStax connector or Presto.
> Cassandra isn't built for ad-hoc queries so you need to use other tools to
> make it work.
>
>
>
> On Thu, Jan 26, 2017 at 2:22 PM Yu, John  wrote:
>
> Hi All,
>
>
>
> Hope 

Re: [External] Re: Cassandra ad hoc search options

2017-01-30 Thread vincent gromakowski
I gave a try on spark+filodb and it's very interesting for ad-hoc queries

Le 31 janv. 2017 7:20 AM, "siddharth verma"  a
écrit :

Hi,
*Are you using the DataStax connector as well? *
Yes, we used it to query on lucene index.

*Does it support querying against any column well (not just clustering
columns)?*
Yes it does. We used lucene particularly for this purpose.
( You can use :
1. https://github.com/Stratio/cassandra-lucene-index/blob/branch-3.0.10/doc/
documentation.rst#searching
2. https://www.youtube.com/watch?v=Hg5s-hXy_-M
for more details)

*I’m wondering how it could build the index around them “on-the-fly”*
You can build indexes at run time, but it takes time(took a lot of time on
our cluster. Plus, CPU utilization went through the roof)

*did you use Spark for the full set of data or just partial*
We weren't allowed to install spark ( tech decision)
Some tech discussions going around for the bulk job ecosystem.

Hence as a work around, we used a faster scan utility.
For all the adhoc purposes/scripts, you could do a full scan.

I hope it helps.

Regards


On Tue, Jan 31, 2017 at 4:11 AM, Yu, John  wrote:

> A follow up question is: did you use Spark for the full set of data or
> just partial? In our case, I feel we need all the data to support ad hoc
> queries (with multiple conditional filters).
>
>
>
> Thanks,
>
> John
>
>
>
> *From:* Yu, John [mailto:john...@sandc.com]
> *Sent:* Monday, January 30, 2017 12:04 AM
> *To:* user@cassandra.apache.org
> *Subject:* RE: [External] Re: Cassandra ad hoc search options
>
>
>
> Thanks for the input! Are you using the DataStax connector as well? Does
> it support querying against any column well (not just clustering columns)?
> I’m wondering how it could build the index around them “on-the-fly”.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* siddharth verma [mailto:sidd.verma29.l...@gmail.com
> ]
> *Sent:* Friday, January 27, 2017 12:15 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: [External] Re: Cassandra ad hoc search options
>
>
>
> Hi
>
> We used lucene stratio plugin with C*3.0.3
>
>
>
> Helped to solve a lot of some read patterns. Served well for prefix.
>
> But created problems as repairs failed repeatedly.
>
> We might have used it sub optimally, not sure.
>
>
>
> Later, we had to do away with it, and tried to serve most of the read
> patterns with materialised views. (currently C*3.0.9)
>
>
>
> Currently, for adhoc querries, we use spark or full scan.
>
>
>
> Regards,
>
>
>
> On Fri, Jan 27, 2017 at 1:03 PM, Yu, John  wrote:
>
> Thanks a lot. Mind sharing a couple of points where you feel it’s better
> than the alternatives.
>
>
>
> Regards,
>
> John
>
>
>
> *From:* Jonathan Haddad [mailto:j...@jonhaddad.com]
> *Sent:* Thursday, January 26, 2017 2:33 PM
> *To:* user@cassandra.apache.org
> *Subject:* [External] Re: Cassandra ad hoc search options
>
>
>
> > With Cassandra, what are the options for ad hoc query/search similar to
> RDBMS?
>
>
>
> Your best options are Spark w/ the DataStax connector or Presto.
> Cassandra isn't built for ad-hoc queries so you need to use other tools to
> make it work.
>
>
>
> On Thu, Jan 26, 2017 at 2:22 PM Yu, John  wrote:
>
> Hi All,
>
>
>
> Hope I can get some help here. We’re using Cassandra for services, and
> recently we’re adding UI support.
>
> With Cassandra, what are the options for ad hoc query/search similar to
> RDBMS? We love the features of Cassandra but it seems it’s a known
> “weakness” that it doesn’t come with strong support of indexing and ad hoc
> queries. There’re some recent development with SASI as part of secondary
> index. However I heard from a video where it says it shall not be
> extensively used.
>
>
>
> Has anyone have much experience with SASI? How does it compare to Lucene
> plugin?
>
> What is the direction of Apache Cassandra in the search area?
>
>
>
> We’re also looking into Solr or ElasticSearch integration, but it seems it
> might take more efforts, and possibly involve data duplication.
>
> For Solr, we don’t have DSE.
>
> Sorry if this has been asked before, but I haven’t seen a more complete
> answer.
>
>
>
> Thanks!
>
> John
> --
>
> NOTICE OF CONFIDENTIALITY:
> This message may contain information that is considered confidential and
> which may be prohibited from disclosure under applicable law or by
> contractual agreement. The information is intended solely for the use of
> the individual or entity named above. If you are not the intended
> recipient, you are hereby notified that any disclosure, copying,
> distribution or use of the information contained in or attached to this
> message is strictly prohibited. If you have received this email
> transmission in error, please notify the sender by replying to this email
> and then delete it from your system.
>
>
>
>
>
> --
>
> Siddharth Verma
>
> (Visit 

Re: are there any free Cassandra -> ElasticSearch connector / plugin ?

2016-10-13 Thread vincent gromakowski
Elassandra
https://github.com/vroyer/elassandra

Le 14 oct. 2016 12:02 AM, "Eric Ho"  a écrit :

> I don't want to change my code to write into C* and then to ES.
> So, I'm looking for some sort of a sync tool that will sync my C* table
> into ES and it should be smart enough to avoid duplicates or gaps.
> Is there such a tool / plugin ?
> I'm using stock apache Cassandra 3.7.
> I know that some premium Cassandra has ES builtin or integrated but I
> can't afford premium right now...
> Thanks.
>
> -eric ho
>
>


Re: Cassandra data modeling for a social network

2016-05-31 Thread vincent gromakowski
Or use graphframes (Spark) over cassandra to store separately a graph of
users and followers and next a table of tweet. You will be able to join
data between those 2 structures using spark.

2016-05-31 14:27 GMT+02:00 :

> Hello,
>
>   >* First, Is this data modeling correct for follow base (follower,
> following actions) social network?*
>
>
>
> For social network, I advise you to see Graph Databases, over Cassandra
>
>
>
> Example :
> https://academy.datastax.com/resources/getting-started-graph-databases
>
>
>
> *De :* Mohammad Kermani [mailto:98kerm...@gmail.com]
> *Envoyé :* lundi 30 mai 2016 13:42
> *À :* user@cassandra.apache.org
> *Objet :* Cassandra data modeling for a social network
>
>
>
> We are using Cassandra for our social network and we are designing/data
> modeling tables we need, it is confusing for us and we don't know how to
> design some tables and we have some little problems!
>
>
>
> *As we understood for every query we have to have different tables*, and
> for example user A is following user C and B.
>
> Now, in Cassandra we have a table that is posts_by_user:
>
> user_id  |  post_id   |  text  |  created_on  |  deleted  |  
> view_count
>
>
>
> likes_count  |  comments_count  |  user_full_name
>
> And we have a table according to the followers of users, we insert the
> post's info to the table called user_timeline that when the follower
> users are visiting the first web page we get the post from database from
> user_timeline table.
>
> And here is user_timeline table:
>
> follower_id  |  post_id  | user_id (who posted)  |  likes_count  |
>
>
>
> comments_count   |   location_name   |  user_full_name
>
> *First, Is this data modeling correct for follow base (follower, following
> actions) social network?*
>
> And now we want to count likes of a post, as you see we have number of
> likes in both tables*(**user_timeline**,* *posts_by_user**)*, and imagine
> one user has 1000 followers then by each like action we have to update all
> 1000 rows in user_timeline and 1 row in posts_by_users; And this is not
> logical!
>
> *Then, my second question is How should it be? I mean how should like
> (favorite) table be?*
>
>
>
> Thank you
>
> I wish I can get answer
>
> _
>
> Ce message et ses pieces jointes peuvent contenir des informations 
> confidentielles ou privilegiees et ne doivent donc
> pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu 
> ce message par erreur, veuillez le signaler
> a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
> electroniques etant susceptibles d'alteration,
> Orange decline toute responsabilite si ce message a ete altere, deforme ou 
> falsifie. Merci.
>
> This message and its attachments may contain confidential or privileged 
> information that may be protected by law;
> they should not be distributed, used or copied without authorisation.
> If you have received this email in error, please notify the sender and delete 
> this message and its attachments.
> As emails may be altered, Orange is not liable for messages that have been 
> modified, changed or falsified.
> Thank you.
>
>


Re: Data platform support

2016-05-10 Thread vincent gromakowski
Maybe a SMACK stack would be a better option for using spark with
Cassandra...
Le 10 mai 2016 8:45 AM, "Srini Sydney"  a écrit :

> Thanks a lot..denise
>
> On 10 May 2016 at 02:42, Denise Rogers  wrote:
>
>> It really depends how close you want to stay to the most current versions
>> of open source community products.
>>
>> Cloudera has tended to build more products that requires their
>> distribution to not be as current with open source product versions.
>>
>> Regards,
>> Denise
>>
>> Sent from mi iPhone
>>
>> > On May 9, 2016, at 8:21 PM, Srini Sydney 
>> wrote:
>> >
>> > Hi guys
>> >
>> > We are thinking of using one the 3 big data platforms i.e hortonworks ,
>> mapr or cloudera . Will use hadoop ,hive , zookeeper, and spark in these
>> platforms.
>> >
>> >
>> > Which platform would be better suited for cassandra ?
>> >
>> >
>> > -  sreeni
>> >
>>
>>
>


Re: Efficiently filtering results directly in CS

2016-04-09 Thread vincent gromakowski
spark over c* can pushdown lots of things (basic filter or where clause to
more advanced semi join)

2016-04-09 3:54 GMT+02:00 kurt Greaves <k...@instaclustr.com>:

> If you're using C* 3.0 you can probably achieve this with UDFs.
> http://www.planetcassandra.org/blog/user-defined-functions-in-cassandra-3-0/
>
> On 9 April 2016 at 00:22, Kevin Burton <bur...@spinn3r.com> wrote:
>
>> Ha..  Yes... C*...  I guess I need something like coprocessors in
>> bigtable.
>>
>> On Fri, Apr 8, 2016 at 1:49 AM, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> c* I suppose
>>>
>>> 2016-04-07 19:30 GMT+02:00 Jonathan Haddad <j...@jonhaddad.com>:
>>>
>>>> What is CS?
>>>>
>>>> On Thu, Apr 7, 2016 at 10:03 AM Kevin Burton <bur...@spinn3r.com>
>>>> wrote:
>>>>
>>>>> I have a paging model whereby we stream data from CS by fetching
>>>>> 'pages' thereby reading (sequentially) entire datasets.
>>>>>
>>>>> We're using the bucket approach where we write data for 5 minutes,
>>>>> then we can just fetch the bucket for that range.
>>>>>
>>>>> Our app now has TONS of data and we have a piece of middleware that
>>>>> filters it based on the client requests.
>>>>>
>>>>> So if they only want english they just get english and filter away
>>>>> about 60% of our data.
>>>>>
>>>>> but it doesn't support condition pushdown.  So ALL this data has to be
>>>>> sent from our CS boxes to our middleware and filtered there (wasting a lot
>>>>> of network IO).
>>>>>
>>>>> Is there away (including refactoring the code) that I could push this
>>>>> this into CS?  Maybe some way I could discovery the CS topology and put
>>>>> daemons on each of our CS boxes and fetch from CS directly (doing the
>>>>> filtering there).
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> --
>>>>>
>>>>> We’re hiring if you know of any awesome Java Devops or Linux
>>>>> Operations Engineers!
>>>>>
>>>>> Founder/CEO Spinn3r.com
>>>>> Location: *San Francisco, CA*
>>>>> blog: http://burtonator.wordpress.com
>>>>> … or check out my Google+ profile
>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>
>>>>>
>>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>
>
>
> --
> Kurt Greaves
> k...@instaclustr.com
> www.instaclustr.com
>


Re: Efficiently filtering results directly in CS

2016-04-08 Thread vincent gromakowski
c* I suppose

2016-04-07 19:30 GMT+02:00 Jonathan Haddad :

> What is CS?
>
> On Thu, Apr 7, 2016 at 10:03 AM Kevin Burton  wrote:
>
>> I have a paging model whereby we stream data from CS by fetching 'pages'
>> thereby reading (sequentially) entire datasets.
>>
>> We're using the bucket approach where we write data for 5 minutes, then
>> we can just fetch the bucket for that range.
>>
>> Our app now has TONS of data and we have a piece of middleware that
>> filters it based on the client requests.
>>
>> So if they only want english they just get english and filter away about
>> 60% of our data.
>>
>> but it doesn't support condition pushdown.  So ALL this data has to be
>> sent from our CS boxes to our middleware and filtered there (wasting a lot
>> of network IO).
>>
>> Is there away (including refactoring the code) that I could push this
>> this into CS?  Maybe some way I could discovery the CS topology and put
>> daemons on each of our CS boxes and fetch from CS directly (doing the
>> filtering there).
>>
>> Thoughts?
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> 
>>
>>


Re: cassandra disks cache on SSD

2016-04-01 Thread vincent gromakowski
Can you provide me a approximate estimation of performance gain ?

2016-04-01 19:27 GMT+02:00 Mateusz Korniak <mateusz-li...@ant.gliwice.pl>:

> On Friday 01 April 2016 13:16:53 vincent gromakowski wrote:
> > (...)  looking
> > for a way to use some kind of tiering with few SSD caching hot data from
> > HDD.
> > I have identified two solutions (...)
>
> We are using lvmcache for that.
> Regards,
> --
> Mateusz Korniak
> "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś,
> krótko mówiąc - podpora społeczeństwa."
> Nikos Kazantzakis - "Grek Zorba"
>
>


cassandra disks cache on SSD

2016-04-01 Thread vincent gromakowski
I am looking for way to optimize large reads.
I have seen using SSD is a good option but out of budget, so I am looking
for a way to use some kind of tiering with few SSD caching hot data from
HDD.
I have identified two solutions and would like to get opinions from you and
if you have any experience using them:
- use ZFS with L2ARC functionality
- use Rapiddisk/Rapidcache Linux kernel module
Any opinion ? Constraints ? REX ?
Thanks