Re: Adding New Nodes/Data Center to an existing Cluster.

2015-08-31 Thread Neha Trivedi
Hi,
Can you specify which version of Cassandra you are using?
Can you provide the Error Stack ?

regards
Neha

On Tue, Sep 1, 2015 at 2:56 AM, Sebastian Estevez <
sebastian.este...@datastax.com> wrote:

> or https://issues.apache.org/jira/browse/CASSANDRA-8611 perhaps
>
> All the best,
>
>
> [image: datastax_logo.png] 
>
> Sebastián Estévez
>
> Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com
>
> [image: linkedin.png]  [image:
> facebook.png]  [image: twitter.png]
>  [image: g+.png]
> 
> 
>
>
> 
>
> DataStax is the fastest, most scalable distributed database technology,
> delivering Apache Cassandra to the world’s most innovative enterprises.
> Datastax is built to be agile, always-on, and predictably scalable to any
> size. With more than 500 customers in 45 countries, DataStax is the
> database technology and transactional backbone of choice for the worlds
> most innovative companies such as Netflix, Adobe, Intuit, and eBay.
>
> On Mon, Aug 31, 2015 at 5:24 PM, Eric Evans  wrote:
>
>>
>> On Mon, Aug 31, 2015 at 1:32 PM, Sachin Nikam  wrote:
>>
>>> When we add 3 more nodes in Data Center B, the repair tool starts
>>> syncing the data between 2 data centers and then gives up after ~2 days.
>>>
>>> Has anybody run in to similar issue before? If so what is the solution?
>>>
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-9624, maybe?
>>
>>
>> --
>> Eric Evans
>> eev...@wikimedia.org
>>
>
>


test email please ignore

2015-08-31 Thread Ahamed, Aadil
test


Adding New Nodes/Data Center to an existing Cluster.

2015-08-31 Thread Sachin Nikam
Here is the situation.
We have 3 nodes in Data Center A with Replication Factor of 2.
We want to add 3 more nodes in Data Center B with Replication Factor of 2.
Each node in Data Center A has about 150GB of data.

When we add 3 more nodes in Data Center B, the repair tool starts syncing
the data between 2 data centers and then gives up after ~2 days.

Has anybody run in to similar issue before? If so what is the solution?
Regards
Sachin


Re: Adding New Nodes/Data Center to an existing Cluster.

2015-08-31 Thread Eric Evans
On Mon, Aug 31, 2015 at 1:32 PM, Sachin Nikam  wrote:

> When we add 3 more nodes in Data Center B, the repair tool starts syncing
> the data between 2 data centers and then gives up after ~2 days.
>
> Has anybody run in to similar issue before? If so what is the solution?
>

https://issues.apache.org/jira/browse/CASSANDRA-9624, maybe?


-- 
Eric Evans
eev...@wikimedia.org


Re: Adding New Nodes/Data Center to an existing Cluster.

2015-08-31 Thread Sebastian Estevez
or https://issues.apache.org/jira/browse/CASSANDRA-8611 perhaps

All the best,


[image: datastax_logo.png] 

Sebastián Estévez

Solutions Architect | 954 905 8615 | sebastian.este...@datastax.com

[image: linkedin.png]  [image:
facebook.png]  [image: twitter.png]
 [image: g+.png]





DataStax is the fastest, most scalable distributed database technology,
delivering Apache Cassandra to the world’s most innovative enterprises.
Datastax is built to be agile, always-on, and predictably scalable to any
size. With more than 500 customers in 45 countries, DataStax is the
database technology and transactional backbone of choice for the worlds
most innovative companies such as Netflix, Adobe, Intuit, and eBay.

On Mon, Aug 31, 2015 at 5:24 PM, Eric Evans  wrote:

>
> On Mon, Aug 31, 2015 at 1:32 PM, Sachin Nikam  wrote:
>
>> When we add 3 more nodes in Data Center B, the repair tool starts syncing
>> the data between 2 data centers and then gives up after ~2 days.
>>
>> Has anybody run in to similar issue before? If so what is the solution?
>>
>
> https://issues.apache.org/jira/browse/CASSANDRA-9624, maybe?
>
>
> --
> Eric Evans
> eev...@wikimedia.org
>


future very wide row support

2015-08-31 Thread Dan Kinder
Hi,

My understanding is that wide row support (i.e. many columns/CQL-rows/cells
per partition key) has gotten much better in the past few years; even
though the theoretical of 2 billion has been much higher than practical for
a long time, it seems like now Cassandra is able to handle these better
(ex. incremental compactions so Cassandra doesn't OOM).

So I'm wondering:

   - With more recent improvements (say, including up to 2.2 or maybe 3.0),
   is the practical limit still much lower than 2 billion? Do we have any idea
   what limits us in this regard? (Maybe repair is still another bottleneck?)
   - Is the 2 billion limit a SSTable limitation?
   https://issues.apache.org/jira/browse/CASSANDRA-7447 seems to indicate
   that it might be. Is there any future work we think will increase this
   limit?

A couple of caveats:

I am aware that even if such a large partition is possible it may not
usually be practical because it works against Cassandra's primary feature
of sharding data to multiple nodes and parallelize access. However some
analytics/batch processing use-cases could benefit from the guarantee that
a certain set of data is together on a node. It can also make certain data
modeling situations a bit easier, where currently we just need to model
around the limitation. Also, 2 billion rows for small columns only adds up
to data in the tens of gigabytes, and use of larger nodes these days means
that practically one node could hold much larger partitions. And lastly,
there are just cases where the 99.999% of partition keys are going to be
pretty small, but there are potential outliers that could be very large; it
would be great for Cassandra to handle these even if it is suboptimal,
helping us all avoid having to model around such exceptions.

Well, this turned into something of an essay... thanks for reading and glad
to receive input on this.


Re: Network / GC / Latency spike

2015-08-31 Thread Fabien Rousseau
Hi Alain,

Could it be wide rows + read repair ? (Let's suppose the "read repair"
repairs the full row, and it may not be subject to stream throughput limit)

Best Regards
Fabien

2015-08-31 15:56 GMT+02:00 Alain RODRIGUEZ :

> I just realised that I have no idea about how this mailing list handle
> attached files.
>
> Please find screenshots there --> http://img42.com/collection/y2KxS
>
> Alain
>
> 2015-08-31 15:48 GMT+02:00 Alain RODRIGUEZ :
>
>> Hi,
>>
>> Running a 2.0.16 C* on AWS (private VPC, 2 DC).
>>
>> I am facing an issue on our EU DC where I have a network burst (alongside
>> with GC and latency increase).
>>
>> My first thought was a sudden application burst, though, I see no
>> corresponding evolution on reads / write or even CPU.
>>
>> So I thought that this might come from the node themselves as IN almost
>> equal OUT Network. I tried lowering stream throughput on the whole DC to 1
>> Mbps, with ~30 nodes --> 30 Mbps --> ~4 MB/s max. My network went a lot
>> higher about 30 M in both sides (see screenshots attached).
>>
>> I have tried to use iftop to see where this network is headed too, but I
>> was not able to do it because burst are very shorts.
>>
>> So, questions are:
>>
>> - Did someone experienced something similar already ? If so, any clue
>> would be appreciated :).
>> - How can I know (monitor, capture) where this big amount of network is
>> headed to or due to ?
>> - Am I right trying to figure out what this network is or should I follow
>> an other lead ?
>>
>> Notes: I also noticed that CPU does not spike nor does R, but disk
>> reads also spikes !
>>
>> C*heers,
>>
>> Alain
>>
>
>


Rebuild new DC nodes against new DC?

2015-08-31 Thread Bryan Cheng
Hi list,

We're bringing up a second DC, and following the procedure outlined here:
http://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_dc_to_cluster_t.html

We have three nodes in the new DC that are members of the cluster and
indicate that they are running normally. We have begun the process of
altering the keyspaces for multi-DC and are streaming over data via
nodetool rebuild on a keyspace-by-keyspace basis.

I couldn't find a clear answer for this: at what point is it safe to
rebuild from the new dc versus the old?

In other words, I have machines a, b, and c in DC2 (the new DC). I build a
and b by specifying DC1 on the rebuild command line. Can I safely rebuild
against DC2 for machine c? Is this at all dependent on quorum settings?

Our DC's are linked by a VPN that doesn't have as big of a pipe as we'd
like- streaming in the new DC would make things faster and ease some
headaches.

Thanks for any help!

--Bryan


RE: Cassandra 2.2 for time series

2015-08-31 Thread Pål Andreassen
Cassandra 2.2 has min and max built-in. My problem is getting the corresponding 
sample time as well.

Pål Andreassen
54°23'58"S 3°18'53"E
Konsulent
Mobil +47 982 85 504
pal.andreas...@bouvet.no

Bouvet Norge AS
Avdeling Grenland
Uniongata 18, Klosterøya
N-3732 Skien
Tlf +47 23 40 60 00
bouvet.no

From: Peter Lin [mailto:wool...@gmail.com]
Sent: mandag 31. august 2015 16.09
To: user@cassandra.apache.org
Subject: Re: Cassandra 2.2 for time series


Unlike SQL, CQL doesn't have built-in functions like max/min
In the past, people would create summary tables to keep rolling stats for 
reports/analytics. In cql3, there's user defined functions, so you can write a 
function to do max/min

http://cassandra.apache.org/doc/cql3/CQL-2.2.html#selectStmt
http://cassandra.apache.org/doc/cql3/CQL-2.2.html#udfs

On Mon, Aug 31, 2015 at 9:48 AM, Pål Andreassen 
> wrote:
Hi

I’m currently evaluating Cassandra as a potiantial database for storing time 
series data from lots of devices (IoT type of scenario).
Currently we have a few thousand devices with X channels (measurements) that 
they report at different intervals (from 5 minutes and up).

I’ve created as simple test table to store the data:

CREATE TABLE DataRaw(
  channelId int,
  sampleTime timestamp,
  value double,
  PRIMARY KEY (channelId, sampleTime)
) WITH CLUSTERING ORDER BY (sampleTime ASC);

This schema seems to work ok, but I have queries that I need to support that I 
cannot easily figure out how to perform (except getting all the data out and 
iterate it myself).

Query 1: For max and min queries, I not only want the maximum/minimum value, 
but also the corresponding timestamp.


sampleTime  value

2015-08-28 00:0010

2015-08-28 01:0015

2015-08-28 02:0013

I'd like the max query to return both 2015-08-28 01:00 and 15. SELECT 
sampleTime, max(value) FROM DataRAW return the max value, but the first 
sampleTime.
Also I wonder if Cassandra has built-in support for 
interpolation/extrapolation. Some sort of group by hour/day/week/month and even 
year function.

Query 2: Give me hourly averages for channel X for yesterday. I’d expect to get 
24 values each of which is the hourly average. Or give my daily averages for 
last year for a given channel. Should return 365 daily averages.

Best regards

Pål Andreassen
54°23'58"S 3°18'53"E
Konsulent
Mobil +47 982 85 504
pal.andreas...@bouvet.no

Bouvet Norge AS
Avdeling Grenland
Uniongata 18, Klosterøya
N-3732 Skien
Tlf +47 23 40 60 00
bouvet.no




Cassandra 2.2 for time series

2015-08-31 Thread Pål Andreassen
Hi

I'm currently evaluating Cassandra as a potiantial database for storing time 
series data from lots of devices (IoT type of scenario).
Currently we have a few thousand devices with X channels (measurements) that 
they report at different intervals (from 5 minutes and up).

I've created as simple test table to store the data:

CREATE TABLE DataRaw(
  channelId int,
  sampleTime timestamp,
  value double,
  PRIMARY KEY (channelId, sampleTime)
) WITH CLUSTERING ORDER BY (sampleTime ASC);

This schema seems to work ok, but I have queries that I need to support that I 
cannot easily figure out how to perform (except getting all the data out and 
iterate it myself).

Query 1: For max and min queries, I not only want the maximum/minimum value, 
but also the corresponding timestamp.


sampleTime  value

2015-08-28 00:0010

2015-08-28 01:0015

2015-08-28 02:0013

I'd like the max query to return both 2015-08-28 01:00 and 15. SELECT 
sampleTime, max(value) FROM DataRAW return the max value, but the first 
sampleTime.
Also I wonder if Cassandra has built-in support for 
interpolation/extrapolation. Some sort of group by hour/day/week/month and even 
year function.

Query 2: Give me hourly averages for channel X for yesterday. I'd expect to get 
24 values each of which is the hourly average. Or give my daily averages for 
last year for a given channel. Should return 365 daily averages.

Best regards

Pål Andreassen
54°23'58"S 3°18'53"E
Konsulent
Mobil +47 982 85 504
pal.andreas...@bouvet.no

Bouvet Norge AS
Avdeling Grenland
Uniongata 18, Klosterøya
N-3732 Skien
Tlf +47 23 40 60 00
bouvet.no