Re: cassandra disks cache on SSD

2016-04-01 Thread vincent gromakowski
Can you provide me a approximate estimation of performance gain ?

2016-04-01 19:27 GMT+02:00 Mateusz Korniak :

> On Friday 01 April 2016 13:16:53 vincent gromakowski wrote:
> > (...)  looking
> > for a way to use some kind of tiering with few SSD caching hot data from
> > HDD.
> > I have identified two solutions (...)
>
> We are using lvmcache for that.
> Regards,
> --
> Mateusz Korniak
> "(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś,
> krótko mówiąc - podpora społeczeństwa."
> Nikos Kazantzakis - "Grek Zorba"
>
>


Re: Adding Options to Create Statements...

2016-04-01 Thread James Carman
But, if there were a Java driver provided by the Apache Cassandra project
itself, then it'd be an easy choice.


On Fri, Apr 1, 2016 at 2:16 PM Robert Coli  wrote:

> On Fri, Apr 1, 2016 at 10:43 AM, James Carman 
> wrote:
>
>> A, my bad.  One might wonder why the heck the Java driver is "owned"
>> by an outside entity, eh?
>>
>
> FWIW, the status quo prior to the Datastax drivers was a wide assortment
> of non-compatible drivers in different languages, not one set of
> officially-supported-by-the-Apache-project ones..
>
> ... you probably would not have preferred it? :D
>
> =Rob
>
>


Re: Adding Options to Create Statements...

2016-04-01 Thread Robert Coli
On Fri, Apr 1, 2016 at 10:43 AM, James Carman 
wrote:

> A, my bad.  One might wonder why the heck the Java driver is "owned"
> by an outside entity, eh?
>

FWIW, the status quo prior to the Datastax drivers was a wide assortment of
non-compatible drivers in different languages, not one set of
officially-supported-by-the-Apache-project ones..

... you probably would not have preferred it? :D

=Rob


Re: Adding Options to Create Statements...

2016-04-01 Thread Jonathan Haddad
Because it's a community driver not provided by the Apache project.  There
have historically been community provided drivers in the past.  See
Hector, Astyanax, pycassa, etc.

On Fri, Apr 1, 2016 at 10:43 AM James Carman 
wrote:

> A, my bad.  One might wonder why the heck the Java driver is "owned"
> by an outside entity, eh?
>
> On Fri, Apr 1, 2016 at 11:58 AM Tyler Hobbs  wrote:
>
>> I'm not sure which driver you're referring to, but if it's the java
>> driver, it has its own mailing list that may be more helpful:
>> https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user
>>
>> On Thu, Mar 31, 2016 at 4:40 PM, James Carman > > wrote:
>>
>>> No thoughts? Would an upgrade of the driver "fix" this?
>>>
>>> On Wed, Mar 30, 2016 at 10:42 AM James Carman <
>>> ja...@carmanconsulting.com> wrote:
>>>
 I am trying to perform the following operation:

 public Create createCreate() {
   Create create =
 SchemaBuilder.createTable("foo").addPartitionColumn("bar",
 varchar()).addClusteringColumn("baz", varchar);
   if(descending) {
 create.withOptions().clusteringOrder("baz", Direction.DESC);
   return create;
 }

 I don't want to have to return the Create.Options object from this
 method (as I may need to add other columns).  Is there a way to have the
 options "decorate" the Create directly without having to return the
 Create.Options?


>>
>>
>> --
>> Tyler Hobbs
>> DataStax 
>>
>


Re: Adding Options to Create Statements...

2016-04-01 Thread James Carman
A, my bad.  One might wonder why the heck the Java driver is "owned" by
an outside entity, eh?

On Fri, Apr 1, 2016 at 11:58 AM Tyler Hobbs  wrote:

> I'm not sure which driver you're referring to, but if it's the java
> driver, it has its own mailing list that may be more helpful:
> https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user
>
> On Thu, Mar 31, 2016 at 4:40 PM, James Carman 
> wrote:
>
>> No thoughts? Would an upgrade of the driver "fix" this?
>>
>> On Wed, Mar 30, 2016 at 10:42 AM James Carman 
>> wrote:
>>
>>> I am trying to perform the following operation:
>>>
>>> public Create createCreate() {
>>>   Create create =
>>> SchemaBuilder.createTable("foo").addPartitionColumn("bar",
>>> varchar()).addClusteringColumn("baz", varchar);
>>>   if(descending) {
>>> create.withOptions().clusteringOrder("baz", Direction.DESC);
>>>   return create;
>>> }
>>>
>>> I don't want to have to return the Create.Options object from this
>>> method (as I may need to add other columns).  Is there a way to have the
>>> options "decorate" the Create directly without having to return the
>>> Create.Options?
>>>
>>>
>
>
> --
> Tyler Hobbs
> DataStax 
>


Re: Multi DC setup for analytics

2016-04-01 Thread Laszlo Jobs
Anishek,

AFAIK you can not have clusters "overlap" each oder.

Just an idea: Try to address it as an sstable restore.
http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_snapshot_restore_new_cluster.html

What I would try to do (not tested!):

- create a logical DC in each cluster (CLUSTER_1 and CLUSTER_2), with
limited number of nodes so you do not need to backup a lot of nodes. lets
call it DC_AR (for Analytics Replica)

- alter replication factor of keyspaces and tables in CLUSTER_1 and
CLUSTER_2 clusters to use their own DC_AR DC, store at least 1 replica
there too (this is a minimal change to CLUSTER_1 and CLUSTER_2)

- when restoring to CLUSTER_3 create snapshots in both DC_AR on each
cluster CLUSTER_1 and CLUSTER_2

- follow the restore procedure described in the link above and restore
sstables to the analytics cluster CLUSTER_3

- create a logical DC (DA_A) in each Cluster, CLUSTER_1 and CLUSTER_2, with
one or more nodes according to your

- if you need more power on the CLUSTER_3 then you can add more nodes after
the restore and repair (could be time consuming)

You might tune the process above as this is just a high level idea.
You need to consider the following thing among others potentially 9i sure
this is not a complete list below):
- maintain the schema in the CLUSTER_3 keyspaces whenever they change on
CLUSTER_1 or CLUSTER_2
- can not use the same keyspace names on CLUSTER_1 and CLUSTER_2
- replication factor for the DC_AR DCs in both clusters CLUSTER_1 and
CLUSTER_2
- what consistency level you use in your application, QUORUM might hurt you
but LOCAL_QUORUM could be OK.
- ensure that clients are not connecting to DC_AR nodes (not a hard
requirement)

If this works, then you do not have to rebuild the clusters you have today
(CLUSTER_1 and CLUSTER_2).

P.S. I am relatively new to Cassandra and using only 3.x versions (using =
playing and learning).

Regards,

Laszlo


On Wed, Mar 30, 2016 at 8:43 AM, Anishek Agarwal  wrote:

> Hey Guys,
>
> We did the necessary changes and were trying to get this back on track,
> but hit another wall,
>
> we have two Clusters in Different DC ( DC1 and DC2) with cluster names (
> CLUSTER_1, CLUSTER_2)
>
> we want to have a common analytics cluster in DC3 with cluster name
> (CLUSTER_3). -- looks like this can't be done, so we have to setup two
> different analytics cluster ? can't we just get data from CLUSTER_1/2 to
> same cluster CLUSTER_3 ?
>
> thanks
> anishek
>
> On Mon, Mar 21, 2016 at 3:31 PM, Anishek Agarwal 
> wrote:
>
>> Hey Clint,
>>
>> we have two separate rings which don't talk to each other but both having
>> the same DC name "DCX".
>>
>> @Raja,
>>
>> We had already gone towards the path you suggested.
>>
>> thanks all
>> anishek
>>
>> On Fri, Mar 18, 2016 at 8:01 AM, Reddy Raja  wrote:
>>
>>> Yes. Here are the steps.
>>> You will have to change the DC Names first.
>>> DC1 and DC2 would be independent clusters.
>>>
>>> Create a new DC, DC3 and include these two DC's on DC3.
>>>
>>> This should work well.
>>>
>>>
>>> On Thu, Mar 17, 2016 at 11:03 PM, Clint Martin <
>>> clintlmar...@coolfiretechnologies.com> wrote:
>>>
 When you say you have two logical DC both with the same name are you
 saying that you have two clusters of servers both with the same DC name,
 nether of which currently talk to each other? IE they are two separate
 rings?

 Or do you mean that you have two keyspaces in one cluster?

 Or?

 Clint
 On Mar 14, 2016 2:11 AM, "Anishek Agarwal"  wrote:

> Hello,
>
> We are using cassandra 2.0.17 and have two logical DC having different
> Keyspaces but both having same logical name DC1.
>
> we want to setup another cassandra cluster for analytics which should
> get data from both the above DC.
>
> if we setup the new DC with name DC2 and follow the steps
> https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_add_dc_to_cluster_t.html
> will it work ?
>
> I would think we would have to first change the names of existing
> clusters to have to different names and then go with adding another dc
> getting data from these?
>
> Also as soon as we add the node the data starts moving... this will
> all be only real time changes done to the cluster right ? we still have to
> do the rebuild to get the data for tokens for node in new cluster ?
>
> Thanks
> Anishek
>

>>>
>>>
>>> --
>>> "In this world, you either have an excuse or a story. I preferred to
>>> have a story"
>>>
>>
>>
>


Re: cassandra disks cache on SSD

2016-04-01 Thread Mateusz Korniak
On Friday 01 April 2016 13:16:53 vincent gromakowski wrote:
> (...)  looking
> for a way to use some kind of tiering with few SSD caching hot data from
> HDD.
> I have identified two solutions (...)

We are using lvmcache for that.
Regards,
-- 
Mateusz Korniak
"(...) mam brata - poważny, domator, liczykrupa, hipokryta, pobożniś,
krótko mówiąc - podpora społeczeństwa."
Nikos Kazantzakis - "Grek Zorba"



Re: Adding Options to Create Statements...

2016-04-01 Thread Tyler Hobbs
I'm not sure which driver you're referring to, but if it's the java driver,
it has its own mailing list that may be more helpful:
https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user

On Thu, Mar 31, 2016 at 4:40 PM, James Carman 
wrote:

> No thoughts? Would an upgrade of the driver "fix" this?
>
> On Wed, Mar 30, 2016 at 10:42 AM James Carman 
> wrote:
>
>> I am trying to perform the following operation:
>>
>> public Create createCreate() {
>>   Create create =
>> SchemaBuilder.createTable("foo").addPartitionColumn("bar",
>> varchar()).addClusteringColumn("baz", varchar);
>>   if(descending) {
>> create.withOptions().clusteringOrder("baz", Direction.DESC);
>>   return create;
>> }
>>
>> I don't want to have to return the Create.Options object from this method
>> (as I may need to add other columns).  Is there a way to have the options
>> "decorate" the Create directly without having to return the Create.Options?
>>
>>


-- 
Tyler Hobbs
DataStax 


cassandra disks cache on SSD

2016-04-01 Thread vincent gromakowski
I am looking for way to optimize large reads.
I have seen using SSD is a good option but out of budget, so I am looking
for a way to use some kind of tiering with few SSD caching hot data from
HDD.
I have identified two solutions and would like to get opinions from you and
if you have any experience using them:
- use ZFS with L2ARC functionality
- use Rapiddisk/Rapidcache Linux kernel module
Any opinion ? Constraints ? REX ?
Thanks


Cassandra sstable to Mysql

2016-04-01 Thread Abhishek Aggarwal
Hi ,

We have the data dump into directory  taken from Mysql using the
CQLSSTableWriter.

Our requirement is to read this data and load it into MySql. We don't want
to use Cassandra as it will lead to read traffic and this operation is just
for some validation .

Can anyone help us with the solution.

Abhishek Aggarwal

*Senior Software Engineer*
*M*: +91 8861212073 , 8588840304
*T*: 0124 6600600 *EXT*: 12128
ASF Center -A, ASF Center Udyog Vihar Phase IV,
Download Our App
[image: A]

[image:
A]

[image:
W]



Re: Cassandra Resource Planning

2016-04-01 Thread Alain RODRIGUEZ
Hi Joe,


> I am doing resource planning and could use some help.


I have been working alone 4 years operating a growing cluster (from 3 to
60+ nodes, from t1.micro instances to I2.2xlarge AWS instances), on the
biggest cluster, +2 other clusters and handling MySQL too :'(. I now joined
a team of Cassandra experts, so I worked in the 2 extremes situations.

So, first thing, it is doable, a guy alone can do this. And I probably
could have handled more nodes (+ I was doing MySQL schema management for
new features).
Second thing, it is a real PITA for one guy to be on its own operating a
production Cassandra cluster, really.

I would say that having a guy working alone is a bad idea. First because
when someone is digging a Cassandra issue, an external point of view is
often very enlightening. It is a complex system and discussing possible
solutions is often worth it. Here the community can help (here, IRC, ...).
Also, anytime your operational guy is out, who will handle operations? What
if your operator leaves your company? This can happen as a lot of people
want to recruit a good Cassandra operator.
My point is using Replication Factor of 2 or 3 for data (with extra cost
induced), and of 1 for people will produce a 'sort of Single Point of
Failure' in Cassandra. Cassandra often need to be 100% up (or close to it),
what happens if Cassandra start failing during your operator's 3 week long
holidays?

I worked during nights, holidays, Christmas, ... in the past, If we would
have been 2 of us, I would have done half (roughly) of the work, and not
during my time off. I might have then stayed longer in my previous company.
Be careful, having only one operator on a Cassandra cluster will probably
exhaust him quite quickly.

>From my own experience, I would say you should probably have a second
person as soon as you can (they can do something else than Cassandra half
of the time if needed at start). But I truly believe 2 people knowing and
able to act on Cassandra is good number to reach asap. If you don't want to
do it, at least make sure to have some other people in your team able to do
support of first level (restart nodes, monitor, understanding how Cassandra
work roughly, apply commands given by the operator - have him preparing
common troubleshooting) and/or consider using external support when your
operator is blocked, as he probably won't be able to answer everything on
his own.

Data is the beating heart of many businesses, and it is still often
under-provisioned (machine and people) as it is a cost, with no direct
income. Think about how much data availability / consistency / latency is
important to keep in a good state in your case and act accordingly :-).

Then, when there is a team of 2, you don't need to scale according to the
number of nodes (Netflix C* team use to be 2 or 3 people and they had 1000+
servers if I remember correctly). The whole things is making sure operators
can, and are encouraged to, automate common actions, script things as much
as they think it is useful. Then a few people can handle a lot of node, it
is far from being linearly related to the number of node.

How many operations people will I need to manage my Cassandra
> implementation for two sites with 10 nodes at each site? As, my cluster
> grows at what point will I need to add another person?


I would finally say that the number of operator needed might actually be
more related to the number of devs / tech team member you have. My team had
60 Devs, and me alone operating Cassandra, which I believe is ridiculous
and I don't recommend. More dev = more features, more modeling work, more
services hitting the database, etc. It also depends on the management /
automation systems in place: basically, if adding a node is a 5 min
operation for this guy or a 2 hours time operation (just to have the node
prepare), you obviously don't need the same amount of people there.

FWIW, here is a post I wrote that I believe might help your operator
handling your small cluster:
http://thelastpickle.com/blog/2016/03/21/running-commands-cluster-wide.html

Those are only personal thoughts and consideration due to my own
experience. Other might have other considerations, maybe see things from
other perspective than the Cassandra operator one (which is mine here). I
hope you will be kind to your operator and find him a friend to talk with!
I think it is better for both the company and for him.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-04-01 7:41 GMT+02:00 Joe Hicks :

> I am doing resource planning and could use some help. How many operations
> people will I need to manage my Cassandra implementation for two sites with
> 10 nodes at each site? As, my cluster grows at what point will I need to
> add another person?
>


Re: Speeding up "nodetool rebuild"

2016-04-01 Thread Alain RODRIGUEZ
Hi,

is there any way to determine that rebuild is complete


If you ran it from a screen (
https://www.gnu.org/software/screen/manual/screen.html) or similar stuff,
you should see the command return.

Also, 'nodetool netstats | grep -v 100%'  will show you remaining stream.
No stream = rebuild finish (look for possible errors in the logs though...).

Last tip is you should be able to imagine how big the dataset is going to
be and checking the on disk size give good progress information too. This
is not really accurate though.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2016-03-31 23:19 GMT+02:00 Anubhav Kale :

> Thanks, is there any way to determine that rebuild is complete.
>
> Based on following line in StorageService.java, it's not logged. So, any
> other way to check besides checking data size through nodetool status ?
>
> finally
> {
> // rebuild is done (successfully or not)
> isRebuilding.set(false);
> }
>
>
> -Original Message-
> From: Eric Evans [mailto:eev...@wikimedia.org]
> Sent: Thursday, March 31, 2016 9:50 AM
> To: user@cassandra.apache.org
> Subject: Re: Speeding up "nodetool rebuild"
>
> On Wed, Mar 30, 2016 at 3:44 PM, Anubhav Kale 
> wrote:
> > Any other ways to make the “rebuild” faster ?
>
> TL;DR add more nodes
>
> If you're encountering a per-stream bottleneck (easy to do if using
> compression), then having a higher node count will translate to higher
> stream concurrency, and greater throughput.
>
> Another thing to keep in mind, the streamthroughput value is *outbound*,
> it doesn't matter what you have that set to on the rebuilding/bootstrapping
> node, it *does* matter what it is set to on the nodes that are sending to
> it (
> https://na01.safelinks.protection.outlook.com/?url=https%3a%2f%2fissues.apache.org%2fjira%2fbrowse%2fCASSANDRA-11303=01%7c01%7cAnubhav.Kale%40microsoft.com%7c27fd8203aa364253b6fc08d3598493a8%7c72f988bf86f141af91ab2d7cd011db47%7c1=rnPHvE12p04CnRXkHgD%2bkllLOqGA4gnlSuM3QsCTpDE%3d
> aims to introduce an inbound tunable though).
>
>
> --
> Eric Evans
> eev...@wikimedia.org
>


Re: NTP Synchronization Setup Changes

2016-04-01 Thread Brice Dutheil
Hi another tip, make sure the OS doesn't come with pre-configured NTP 
synchronisation services. We had a proper NTP setup, but we missed a service 
that came with CentOS that synced to a low stratum NTP server.
-- Brice




On Thu, Mar 31, 2016 at 10:00 AM -0700, "Eric Evans"  
wrote:










On Wed, Mar 30, 2016 at 8:07 PM, Mukil Kesavan
 wrote:
> Are there any issues if this causes a huge time correction on the cassandra
> cluster? I know that NTP gradually corrects the time on all the servers. I
> just wanted to understand if there were any corner cases that will cause us
> to lose data/schema updates when this happens. In particular, we seem to be
> having some issues around missing secondary indices at the moment (not all
> but some).

As a thought experiment, imagine every scenario where it matters to
have one write occur after another (an update followed by a delete is
a good example).  Now imagine having your clock yanked backward to
correct for drift between the first such operation and the second.

I would strongly recommend you come up with a stable NTP setup.


-- 
Eric Evans
eev...@wikimedia.org