RE: Slender Cassandra Cluster Project

2018-01-30 Thread Kenneth Brotman
Hi Yuri,

If possible I will do everything with AWS Cloudformation.  I'm working on it 
now.  Nothing published yet.

Kenneth Brotman

-Original Message-
From: Yuri Subach [mailto:ysub...@gmail.com] 
Sent: Tuesday, January 30, 2018 7:02 PM
To: user@cassandra.apache.org
Subject: RE: Slender Cassandra Cluster Project

Hi Kenneth,

I like this project idea!

A couple of questions:
- What tools are you going to use for AWS cluster setup?
- Do you have anything published already (github)?

On 2018-01-22 22:42:11, Kenneth Brotman  wrote:
> Thanks Anthony!  I’ve made a note to include that information in the 
> documentation. You’re right.  It won’t work as intended unless that is 
> configured properly.
> 
>  
> 
> I’m also favoring a couple other guidelines for Slender Cassandra:
> 
> 1.   SSD’s only, no spinning disks
> 
> 2.   At least two cores per node
> 
>  
> 
> For AWS, I’m favoring the c3.large on Linux.  It’s available in these 
> regions: US-East, US-West and US-West2.  The specifications are listed as:
> 
> · Two (2) vCPU’s
> 
> · 3.7 Gib Memory
> 
> · Two (2) 16 GB SSD’s
> 
> · Moderate I/O
> 
>  
> 
> It’s going to be hard to beat the inexpensive cost of operating a Slender 
> Cluster on demand in the cloud – and it fits a lot of the use cases well:  
> 
>  
> 
> · For under a $100 a month, in current pricing for EC2 instances, you 
> can operate an eighteen (18) node Slender Cluster for five (5) hours a day, 
> ten (10) days a month.  That’s fine for demonstrations, teaching or 
> experiments that last half a day or less. 
> 
> · For under $20, you can have that Slender Cluster up all day long, 
> up to ten (10) hours, for whatever demonstrations or experiments you want it 
> for.
> 
>  
> 
> As always, feedback is encouraged.
> 
>  
> 
> Thanks,
> 
>  
> 
> Kenneth Brotman
> 
>  
> 
> From: Anthony Grasso [mailto:anthony.gra...@gmail.com] 
> Sent: Sunday, January 21, 2018 3:57 PM
> To: user
> Subject: Re: Slender Cassandra Cluster Project
> 
>  
> 
> Hi Kenneth,
> 
>  
> 
> Fantastic idea!
> 
>  
> 
> One thing that came to mind from my reading of the proposed setup was rack 
> awareness of each node. Given that the proposed setup contains three DCs, I 
> assume that each node will be made rack aware? If not, consider defining 
> three racks for each DC and placing two nodes in each rack. This will ensure 
> that all the nodes in a single rack contain at most one replica of the data.
> 
>  
> 
> Regards,
> 
> Anthony
> 
>  
> 
> On 17 January 2018 at 11:24, Kenneth Brotman  
> wrote:
> 
> Sure.  That takes the project from awesome to 10X awesome.  I absolutely 
> would be willing to do that.  Thanks Kurt!
> 
>  
> 
> Regarding your comment on the keyspaces, I agree.  There should be a few 
> simple examples one way or the other that can be duplicated and observed, and 
> then an example to duplicate and play with that has a nice real world mix, 
> with some keyspaces that replicate over only a subset of DC’s and some that 
> replicate to all DC’s.
> 
>  
> 
> Kenneth Brotman 
> 
>  
> 
> From: kurt greaves [mailto:k...@instaclustr.com] 
> Sent: Tuesday, January 16, 2018 1:31 PM
> To: User
> Subject: Re: Slender Cassandra Cluster Project
> 
>  
> 
> Sounds like a great idea. Probably would be valuable to add to the official 
> docs as an example set up if you're willing.
> 
>  
> 
> Only thing I'd add is that you should have keyspaces that replicate over only 
> a subset of DC's, plus one/some replicated to all DC's
> 
>  
> 
> On 17 Jan. 2018 03:26, "Kenneth Brotman"  wrote:
> 
> I’ve begun working on a reference project intended to provide guidance on 
> configuring and operating a modest Cassandra cluster of about 18 nodes 
> suitable for the economic study, demonstration, experimentation and testing 
> of a Cassandra cluster.
> 
>  
> 
> The slender cluster would be designed to be as inexpensive as possible while 
> still using real world hardware in order to lower the cost to those with 
> limited initial resources. Sorry no Raspberry Pi’s for this project.  
> 
>  
> 
> There would be an on-premises version and a cloud version.  Guidance would be 
> provided on configuring the cluster, on demonstrating key Cassandra 
> behaviors, on files sizes, capacity to use with the Slender Cassandra 
> Cluster, and so on.
> 
>  
> 
> Why about eighteen nodes? I tried to figure out what the minimum number of 
> nodes needed for Cassandra to be Cassandra is?  Here were my considerations:
> 
>  
> 
> • A user wouldn’t run Cassandra in just one data center; so at 
> least two datacenters.
> 
> • A user probably would want a third data center available for 
> analytics.
> 
> • There needs to be enough nodes for enough parallelism to 
> observe Cassandra’s distributed nature.
> 
> • The cluster should have 

Re: Nodes show different number of tokens than initially

2018-01-30 Thread Dikang Gu
What's the partitioner you use? We have logic to prevent duplicate tokens.

private static Collection adjustForCrossDatacenterClashes(final
TokenMetadata tokenMetadata,

StrategyAdapter strategy, Collection tokens)
{
List filtered = Lists.newArrayListWithCapacity(tokens.size());

for (Token t : tokens)
{
while (tokenMetadata.getEndpoint(t) != null)
{
InetAddress other = tokenMetadata.getEndpoint(t);
if (strategy.inAllocationRing(other))
throw new
ConfigurationException(String.format("Allocated token %s already
assigned to node %s. Is another node also allocating tokens?", t,
other));
t = t.increaseSlightly();
}
filtered.add(t);
}
return filtered;
}



On Tue, Jan 30, 2018 at 8:44 AM, Jeff Jirsa  wrote:

> All DCs in a cluster use the same token space in the DHT, so token
> conflicts across datacenters are invalid config
>
>
> --
> Jeff Jirsa
>
>
> On Jan 29, 2018, at 11:50 PM, Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
> On Tue, Jan 30, 2018 at 5:13 AM, kurt greaves 
> wrote:
>
>> Shouldn't happen. Can you send through nodetool ring output from one of
>> those nodes? Also, did the logs have anything to say about tokens when you
>> started the 3 seed nodes?​
>>
>
> Hi Kurt,
>
> I cannot run nodetool ring anymore, since these test nodes are long gone.
> However I've grepped the logs and this is what I've found:
>
> Jan 25 08:57:18 ip-172-31-128-41 docker/cf3ea463915a[854]: INFO 08:57:18
> Nodes /172.31.128.31 and /172.31.128.41 have the same token
> -9223372036854775808. Ignoring /172.31.128.31
> Jan 25 08:57:18 ip-172-31-128-41 docker/cf3ea463915a[854]: INFO 08:57:18
> Nodes /172.31.144.32 and /172.31.128.41 have the same token
> -8454757700450211158. Ignoring /172.31.144.32
> Jan 25 08:58:30 ip-172-31-144-41 docker/48fba443d99f[852]: INFO 08:58:30
> Nodes /172.31.128.41 and /172.31.128.31 have the same token
> -9223372036854775808. /172.31.128.41 is the new owner
> Jan 25 08:58:30 ip-172-31-144-41 docker/48fba443d99f[852]: INFO 08:58:30
> Nodes /172.31.144.32 and /172.31.128.41 have the same token
> -8454757700450211158. Ignoring /172.31.144.32
> Jan 25 08:59:45 ip-172-31-160-41 docker/cced70e132f2[849]: INFO 08:59:45
> Nodes /172.31.128.41 and /172.31.128.31 have the same token
> -9223372036854775808. /172.31.128.41 is the new owner
> Jan 25 08:59:45 ip-172-31-160-41 docker/cced70e132f2[849]: INFO 08:59:45
> Nodes /172.31.144.32 and /172.31.128.41 have the same token
> -8454757700450211158. Ignoring /172.31.144.32
>
> Since we are allocating the tokens for seed nodes manually, it appears
> that the first seed node in the new ring (172.31.128.41) gets the same
> first token (-9223372036854775808) as the node in the old ring
> (172.31.128.31).  The same goes for the 3rd token of the new seed node
> (-8454757700450211158).
>
> What is beyond me is why would that matter and why would token ownership
> change at all, while these nodes are in the *different virtual DCs*?  To me
> this sounds like a paticularly nasty bug...
>
> --
> Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176
> 127-59-707 <+49%20176%2012759707>
>
>


-- 
Dikang


RE: Slender Cassandra Cluster Project

2018-01-30 Thread Yuri Subach
Hi Kenneth,

I like this project idea!

A couple of questions:
- What tools are you going to use for AWS cluster setup?
- Do you have anything published already (github)?

On 2018-01-22 22:42:11, Kenneth Brotman  wrote:
> Thanks Anthony!  I’ve made a note to include that information in the 
> documentation. You’re right.  It won’t work as intended unless that is 
> configured properly.
> 
>  
> 
> I’m also favoring a couple other guidelines for Slender Cassandra:
> 
> 1.   SSD’s only, no spinning disks
> 
> 2.   At least two cores per node
> 
>  
> 
> For AWS, I’m favoring the c3.large on Linux.  It’s available in these 
> regions: US-East, US-West and US-West2.  The specifications are listed as:
> 
> · Two (2) vCPU’s
> 
> · 3.7 Gib Memory
> 
> · Two (2) 16 GB SSD’s
> 
> · Moderate I/O
> 
>  
> 
> It’s going to be hard to beat the inexpensive cost of operating a Slender 
> Cluster on demand in the cloud – and it fits a lot of the use cases well:  
> 
>  
> 
> · For under a $100 a month, in current pricing for EC2 instances, you 
> can operate an eighteen (18) node Slender Cluster for five (5) hours a day, 
> ten (10) days a month.  That’s fine for demonstrations, teaching or 
> experiments that last half a day or less. 
> 
> · For under $20, you can have that Slender Cluster up all day long, 
> up to ten (10) hours, for whatever demonstrations or experiments you want it 
> for.
> 
>  
> 
> As always, feedback is encouraged.
> 
>  
> 
> Thanks,
> 
>  
> 
> Kenneth Brotman
> 
>  
> 
> From: Anthony Grasso [mailto:anthony.gra...@gmail.com] 
> Sent: Sunday, January 21, 2018 3:57 PM
> To: user
> Subject: Re: Slender Cassandra Cluster Project
> 
>  
> 
> Hi Kenneth,
> 
>  
> 
> Fantastic idea!
> 
>  
> 
> One thing that came to mind from my reading of the proposed setup was rack 
> awareness of each node. Given that the proposed setup contains three DCs, I 
> assume that each node will be made rack aware? If not, consider defining 
> three racks for each DC and placing two nodes in each rack. This will ensure 
> that all the nodes in a single rack contain at most one replica of the data.
> 
>  
> 
> Regards,
> 
> Anthony
> 
>  
> 
> On 17 January 2018 at 11:24, Kenneth Brotman  
> wrote:
> 
> Sure.  That takes the project from awesome to 10X awesome.  I absolutely 
> would be willing to do that.  Thanks Kurt!
> 
>  
> 
> Regarding your comment on the keyspaces, I agree.  There should be a few 
> simple examples one way or the other that can be duplicated and observed, and 
> then an example to duplicate and play with that has a nice real world mix, 
> with some keyspaces that replicate over only a subset of DC’s and some that 
> replicate to all DC’s.
> 
>  
> 
> Kenneth Brotman 
> 
>  
> 
> From: kurt greaves [mailto:k...@instaclustr.com] 
> Sent: Tuesday, January 16, 2018 1:31 PM
> To: User
> Subject: Re: Slender Cassandra Cluster Project
> 
>  
> 
> Sounds like a great idea. Probably would be valuable to add to the official 
> docs as an example set up if you're willing.
> 
>  
> 
> Only thing I'd add is that you should have keyspaces that replicate over only 
> a subset of DC's, plus one/some replicated to all DC's
> 
>  
> 
> On 17 Jan. 2018 03:26, "Kenneth Brotman"  wrote:
> 
> I’ve begun working on a reference project intended to provide guidance on 
> configuring and operating a modest Cassandra cluster of about 18 nodes 
> suitable for the economic study, demonstration, experimentation and testing 
> of a Cassandra cluster.
> 
>  
> 
> The slender cluster would be designed to be as inexpensive as possible while 
> still using real world hardware in order to lower the cost to those with 
> limited initial resources. Sorry no Raspberry Pi’s for this project.  
> 
>  
> 
> There would be an on-premises version and a cloud version.  Guidance would be 
> provided on configuring the cluster, on demonstrating key Cassandra 
> behaviors, on files sizes, capacity to use with the Slender Cassandra 
> Cluster, and so on.
> 
>  
> 
> Why about eighteen nodes? I tried to figure out what the minimum number of 
> nodes needed for Cassandra to be Cassandra is?  Here were my considerations:
> 
>  
> 
> • A user wouldn’t run Cassandra in just one data center; so at 
> least two datacenters.
> 
> • A user probably would want a third data center available for 
> analytics.
> 
> • There needs to be enough nodes for enough parallelism to 
> observe Cassandra’s distributed nature.
> 
> • The cluster should have enough nodes that one gets a sense of 
> the need for cluster wide management tools to do things like repairs, 
> snapshots and cluster monitoring.
> 
> • The cluster should be able to demonstrate a RF=3 with local 
> quorum.  If replicated in all three data centers, one write would impact half 
> the 18 nodes, 3 

Re: CDC usability and future development

2018-01-30 Thread Jeff Jirsa
Here's a deck of some proposed additions, discussed at one of the NGCC
sessions last fall:

https://github.com/ngcc/ngcc2017/blob/master/CassandraDataIngestion.pdf



On Tue, Jan 30, 2018 at 5:10 PM, Andrew Prudhomme  wrote:

> Hi all,
>
> We are currently designing a system that allows our Cassandra clusters to
> produce a stream of data updates. Naturally, we have been evaluating if CDC
> can aid in this endeavor. We have found several challenges in using CDC for
> this purpose.
>
> CDC provides only the mutation as opposed to the full column value, which
> tends to be of limited use for us. Applications might want to know the full
> column value, without having to issue a read back. We also see value in
> being able to publish the full column value both before and after the
> update. This is especially true when deleting a column since this stream
> may be joined with others, or consumers may require other fields to
> properly process the delete.
>
> Additionally, there is some difficulty with processing CDC itself such as:
> - Updates not being immediately available (addressed by CASSANDRA-12148)
> - Each node providing an independent streams of updates that must be
> unified and deduplicated
>
> Our question is, what is the vision for CDC development? The current
> implementation could work for some use cases, but is a ways from a general
> streaming solution. I understand that the nature of Cassandra makes this
> quite complicated, but are there any thoughts or desires on the future
> direction of CDC?
>
> Thanks
>
>


CDC usability and future development

2018-01-30 Thread Andrew Prudhomme
Hi all,

We are currently designing a system that allows our Cassandra clusters to
produce a stream of data updates. Naturally, we have been evaluating if CDC
can aid in this endeavor. We have found several challenges in using CDC for
this purpose.

CDC provides only the mutation as opposed to the full column value, which
tends to be of limited use for us. Applications might want to know the full
column value, without having to issue a read back. We also see value in
being able to publish the full column value both before and after the
update. This is especially true when deleting a column since this stream
may be joined with others, or consumers may require other fields to
properly process the delete.

Additionally, there is some difficulty with processing CDC itself such as:
- Updates not being immediately available (addressed by CASSANDRA-12148)
- Each node providing an independent streams of updates that must be
unified and deduplicated

Our question is, what is the vision for CDC development? The current
implementation could work for some use cases, but is a ways from a general
streaming solution. I understand that the nature of Cassandra makes this
quite complicated, but are there any thoughts or desires on the future
direction of CDC?

Thanks


RE: group by select queries

2018-01-30 Thread Modha, Digant
It was local quorum.  There’s no difference with CONSISTENCY ALL.

Consistency level set to LOCAL_QUORUM.
cassandra@cqlsh> select * from wp.position  where account_id = 'user_1';

account_id | security_id | counter | avg_exec_price | pending_quantity | 
quantity | transaction_id | update_time
+-+-++--+--++-
 user_1 |AMZN |   2 | 1239.2 |0 | 
1011 |   null | 2018-01-25 17:18:07.158000+
 user_1 |AMZN |   1 | 1239.2 |0 | 
1010 |   null | 2018-01-25 17:18:07.158000+

(2 rows)
cassandra@cqlsh> select * from wp.position  where account_id = 'user_1' group 
by security_id;

account_id | security_id | counter | avg_exec_price | pending_quantity | 
quantity | transaction_id | update_time
+-+-++--+--++-
 user_1 |AMZN |   1 | 1239.2 |0 | 
1010 |   null | 2018-01-25 17:18:07.158000+

(1 rows)
cassandra@cqlsh> select account_id,security_id, counter, 
avg_exec_price,quantity, update_time from wp.position  where account_id = 
'user_1' group by security_id ;

account_id | security_id | counter | avg_exec_price | quantity | update_time
+-+-++--+-
 user_1 |AMZN |   2 | 1239.2 | 1011 | 2018-01-25 
17:18:07.158000+

(1 rows)
cassandra@cqlsh>  consistency all;
Consistency level set to ALL.
cassandra@cqlsh> select * from wp.position  where account_id = 'user_1' group 
by security_id;

account_id | security_id | counter | avg_exec_price | pending_quantity | 
quantity | transaction_id | update_time
+-+-++--+--++-
 user_1 |AMZN |   1 | 1239.2 |0 | 
1010 |   null | 2018-01-25 17:18:07.158000+

(1 rows)
cassandra@cqlsh> select account_id,security_id, counter, 
avg_exec_price,quantity, update_time from wp.position  where account_id = 
'user_1' group by security_id ;

account_id | security_id | counter | avg_exec_price | quantity | update_time
+-+-++--+-
 user_1 |AMZN |   2 | 1239.2 | 1011 | 2018-01-25 
17:18:07.158000+


From: kurt greaves [mailto:k...@instaclustr.com]
Sent: Monday, January 29, 2018 11:03 PM
To: User
Subject: Re: group by select queries

What consistency were you querying at? Can you retry with CONSISTENCY ALL?

​

TD Securities disclaims any liability or losses either direct or consequential 
caused by the use of this information. This communication is for informational 
purposes only and is not intended as an offer or solicitation for the purchase 
or sale of any financial instrument or as an official confirmation of any 
transaction. TD Securities is neither making any investment recommendation nor 
providing any professional or advisory services relating to the activities 
described herein. All market prices, data and other information are not 
warranted as to completeness or accuracy and are subject to change without 
notice Any products described herein are (i) not insured by the FDIC, (ii) not 
a deposit or other obligation of, or guaranteed by, an insured depository 
institution and (iii) subject to investment risks, including possible loss of 
the principal amount invested. The information shall not be further distributed 
or duplicated in whole or in part by any means without the prior written 
consent of TD Securities. TD Securities is a trademark of The Toronto-Dominion 
Bank and represents TD Securities (USA) LLC and certain investment banking 
activities of The Toronto-Dominion Bank and its subsidiaries.


Not what I‘ve expected Performance

2018-01-30 Thread Jürgen Albersdorfer
Hi, We are using C* 3.11.1 with a 9 Node Cluster built on CentOS Servers eac=
h having 2x Quad Core Xeon, 128GB of RAM and two separate 2TB spinning Disks=
, one for Log one for Data with Spark on Top.

Due to bad Schema (Partitions of about 4 to 8 GB) I need to copy a whole Tab=
le into another having same fields but different partitioning.=20

I expected glowing Iron when I started the copy Job, but instead cannot even=
See some Impact on CPU, mem or disks. - but the Job does copy the Data over=
veeerry slowly at about a MB or two per Minute.

Any suggestion where to start investigation?

Thanks already

Von meinem iPhone gesendet

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Commitlogs are filling the Full Disk space and nodes are down

2018-01-30 Thread Jeff Jirsa
There's an open bug for users that have offheap memtables and secondary
index - there's at least a few people reporting an error flushing that
blocks future flushes.

If you're seeing that, and use that combo, you may want to switch to
on-heap memtables (or contribute a patch to fix the offheap+2i interaction)

On Tue, Jan 30, 2018 at 9:06 AM, Chris Lohfink  wrote:

> The commitlog growing is often a symptom of a problem. If the memtable
> flush or post flush fails in anyway, the commitlogs will not be
> recycled/deleted and will continue to pool up.
>
> Might want to go back in logs earlier to make sure theres nothing like the
> postmemtable flusher getting a permission error (some tooling creates
> commitlogs so if run by wrong user can create this prooblem), or a memtable
> flush error.  You can also check tpstats to see if tasks are queued up in
> postmemtable flusher and jstack to see where the active ones are stuck if
> they are.
>
> Chris
>
> On Jan 30, 2018, at 4:20 AM, Amit Singh  wrote:
>
> Hi,
>
> When you actually say nodetool flush, data from memTable goes to disk
> based structure as SStables and side by side , commit logs segments for
> that particular data get written off and its continuous process . May be in
> your case , you can decrease the value of  below uncommented property in
> Cassandra.yaml
>
> commitlog_total_space_in_mb
>
> Also this is what is it used for
>
> # Total space to use for commit logs on disk.
> #
> # If space gets above this value, Cassandra will flush every dirty CF
> # in the oldest segment and remove it.  So a small total commitlog space
> # will tend to cause more flush activity on less-active columnfamilies.
> #
> # The default value is the smaller of 8192, and 1/4 of the total space
> # of the commitlog volume.
>
>
> *From:* Mokkapati, Bhargav (Nokia - IN/Chennai) [mailto:bhargav.mokkapati@
> nokia.com ]
> *Sent:* Tuesday, January 30, 2018 4:00 PM
> *To:* user@cassandra.apache.org
> *Subject:* Commitlogs are filling the Full Disk space and nodes are down
>
> Hi Team,
>
> My Cassandra version : Apache Cassandra 3.0.13
>
> Cassandra nodes are down due to Commitlogs are getting filled up until
> full disk size.
>
> 
>
> With “Nodetool flush” I didn’t see any commitlogs deleted.
>
> Can anyone tell me how to flush the commitlogs without losing data.
>
> Thanks,
> Bhargav M
>
>
>


Re: Commitlogs are filling the Full Disk space and nodes are down

2018-01-30 Thread Chris Lohfink
The commitlog growing is often a symptom of a problem. If the memtable flush or 
post flush fails in anyway, the commitlogs will not be recycled/deleted and 
will continue to pool up.

Might want to go back in logs earlier to make sure theres nothing like the 
postmemtable flusher getting a permission error (some tooling creates 
commitlogs so if run by wrong user can create this prooblem), or a memtable 
flush error.  You can also check tpstats to see if tasks are queued up in 
postmemtable flusher and jstack to see where the active ones are stuck if they 
are.

Chris

> On Jan 30, 2018, at 4:20 AM, Amit Singh  wrote:
> 
> Hi,
>  
> When you actually say nodetool flush, data from memTable goes to disk based 
> structure as SStables and side by side , commit logs segments for that 
> particular data get written off and its continuous process . May be in your 
> case , you can decrease the value of  below uncommented property in 
> Cassandra.yaml 
>  
> commitlog_total_space_in_mb
>  
> Also this is what is it used for 
>  
> # Total space to use for commit logs on disk.
> #
> # If space gets above this value, Cassandra will flush every dirty CF
> # in the oldest segment and remove it.  So a small total commitlog space
> # will tend to cause more flush activity on less-active columnfamilies.
> #
> # The default value is the smaller of 8192, and 1/4 of the total space
> # of the commitlog volume.
>  
>  
> From: Mokkapati, Bhargav (Nokia - IN/Chennai) 
> [mailto:bhargav.mokkap...@nokia.com] 
> Sent: Tuesday, January 30, 2018 4:00 PM
> To: user@cassandra.apache.org
> Subject: Commitlogs are filling the Full Disk space and nodes are down
>  
> Hi Team,
>  
> My Cassandra version : Apache Cassandra 3.0.13
>  
> Cassandra nodes are down due to Commitlogs are getting filled up until full 
> disk size.
>  
> 
>  
> With “Nodetool flush” I didn’t see any commitlogs deleted.
>  
> Can anyone tell me how to flush the commitlogs without losing data.
>  
> Thanks,
> Bhargav M



Re: Nodes show different number of tokens than initially

2018-01-30 Thread Jeff Jirsa
All DCs in a cluster use the same token space in the DHT, so token conflicts 
across datacenters are invalid config
 

-- 
Jeff Jirsa


> On Jan 29, 2018, at 11:50 PM, Oleksandr Shulgin 
>  wrote:
> 
>> On Tue, Jan 30, 2018 at 5:13 AM, kurt greaves  wrote:
>> Shouldn't happen. Can you send through nodetool ring output from one of 
>> those nodes? Also, did the logs have anything to say about tokens when you 
>> started the 3 seed nodes?​
> 
> Hi Kurt,
> 
> I cannot run nodetool ring anymore, since these test nodes are long gone.  
> However I've grepped the logs and this is what I've found:
> 
> Jan 25 08:57:18 ip-172-31-128-41 docker/cf3ea463915a[854]: INFO  08:57:18 
> Nodes /172.31.128.31 and /172.31.128.41 have the same token 
> -9223372036854775808.  Ignoring /172.31.128.31
> Jan 25 08:57:18 ip-172-31-128-41 docker/cf3ea463915a[854]: INFO  08:57:18 
> Nodes /172.31.144.32 and /172.31.128.41 have the same token 
> -8454757700450211158.  Ignoring /172.31.144.32
> Jan 25 08:58:30 ip-172-31-144-41 docker/48fba443d99f[852]: INFO  08:58:30 
> Nodes /172.31.128.41 and /172.31.128.31 have the same token 
> -9223372036854775808.  /172.31.128.41 is the new owner
> Jan 25 08:58:30 ip-172-31-144-41 docker/48fba443d99f[852]: INFO  08:58:30 
> Nodes /172.31.144.32 and /172.31.128.41 have the same token 
> -8454757700450211158.  Ignoring /172.31.144.32
> Jan 25 08:59:45 ip-172-31-160-41 docker/cced70e132f2[849]: INFO  08:59:45 
> Nodes /172.31.128.41 and /172.31.128.31 have the same token 
> -9223372036854775808.  /172.31.128.41 is the new owner
> Jan 25 08:59:45 ip-172-31-160-41 docker/cced70e132f2[849]: INFO  08:59:45 
> Nodes /172.31.144.32 and /172.31.128.41 have the same token 
> -8454757700450211158.  Ignoring /172.31.144.32
> 
> Since we are allocating the tokens for seed nodes manually, it appears that 
> the first seed node in the new ring (172.31.128.41) gets the same first token 
> (-9223372036854775808) as the node in the old ring (172.31.128.31).  The same 
> goes for the 3rd token of the new seed node (-8454757700450211158).
> 
> What is beyond me is why would that matter and why would token ownership 
> change at all, while these nodes are in the *different virtual DCs*?  To me 
> this sounds like a paticularly nasty bug...
> 
> -- 
> Oleksandr "Alex" Shulgin | Database Engineer | Zalando SE | Tel: +49 176 
> 127-59-707
> 


Re: Heavy one-off writes best practices

2018-01-30 Thread Jeff Jirsa
Two other options, both of which will be faster (and less likely to impact read 
latencies) but require some app side programming, if you’re willing to generate 
the sstables programmatically with CQLSSTableWriter or similar.

Once you do that, you can:

1) stream them in with the sstableloader (which will always send them to the 
right replicas and handle renumbering the generation), or

2) manually figure out what the replicas are, rsync the files out, and call 
nodetool refresh

(If you google around you may see references to bulkSaveToCassandra, which 
seems to be DSE’s implementation of #1 - if you’re a datastax customer you 
could consider just using that, if you’re not you’ll need to recreate it using 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java
 )



- Jeff

-- 
Jeff Jirsa


> On Jan 30, 2018, at 12:12 AM, Julien Moumne  wrote:
> 
> Hello, I am looking for best practices for the following use case :
> 
> Once a day, we insert at the same time 10 full tables (several 100GiB each) 
> using Spark C* driver, without batching, with CL set to ALL.
> 
> Whether skinny rows or wide rows, data for a partition key is always 
> completely updated / overwritten, ie. every command is an insert.
> 
> This imposes a great load on the cluster (huge CPU consumption), this load 
> greatly impacts the constant reads we have. Read latency are fine the rest of 
> the time.
> 
> Is there any best practices we should follow to ease the load when importing 
> data into C* except
>  - reducing the number of concurrent writes and throughput on the driver side
>  - reducing the number of compaction threads and throughput on the cluster
> 
> In particular : 
>  - is there any evidence that writing multiple tables at the same time 
> produces more load than writing the tables one at a time when tables are 
> completely written at once such as we do?
>  - because of the heavy writes, we use STC. Is it the best choice considering 
> data is completely overwritten once a day? Tables contain collections and 
> UDTs.
> 
> (We manage data expiration with TTL set to several days.
> We use SSDs.)
> 
> Thanks!


Re: TWCS not deleting expired sstables

2018-01-30 Thread Thakrar, Jayesh
Thanks Kurt and Kenneth.

Now only if they would work as expected.

node111.ord.ae.tsg.cnvr.net:/ae/disk1/data/ae/raw_logs_by_user-f58b9960980311e79ac26928246f09c1>ls
 -lt | tail
-rw-r--r--. 1 vchadoop vchadoop286889260 Sep 18 14:14 mc-1070-big-Index.db
-rw-r--r--. 1 vchadoop vchadoop12236 Sep 13 20:53 
mc-178-big-Statistics.db
-rw-r--r--. 1 vchadoop vchadoop   92 Sep 13 20:53 mc-178-big-TOC.txt
-rw-r--r--. 1 vchadoop vchadoop  9371211 Sep 13 20:53 
mc-178-big-CompressionInfo.db
-rw-r--r--. 1 vchadoop vchadoop   10 Sep 13 20:53 
mc-178-big-Digest.crc32
-rw-r--r--. 1 vchadoop vchadoop  13609890747 Sep 13 20:53 mc-178-big-Data.db
-rw-r--r--. 1 vchadoop vchadoop  1394968 Sep 13 20:53 mc-178-big-Summary.db
-rw-r--r--. 1 vchadoop vchadoop 11172592 Sep 13 20:53 mc-178-big-Filter.db
-rw-r--r--. 1 vchadoop vchadoop190508739 Sep 13 20:53 mc-178-big-Index.db
drwxr-xr-x. 2 vchadoop vchadoop   10 Sep 12 21:47 backups

node111.ord.ae.tsg.cnvr.net:/ae/disk1/data/ae/raw_logs_by_user-f58b9960980311e79ac26928246f09c1>sstableexpiredblockers
 ae raw_logs_by_user
Exception in thread "main" java.lang.IllegalArgumentException: Unknown 
keyspace/table ae.raw_logs_by_user
at 
org.apache.cassandra.tools.SSTableExpiredBlockers.main(SSTableExpiredBlockers.java:66)

node111.ord.ae.tsg.cnvr.net:/ae/disk1/data/ae/raw_logs_by_user-f58b9960980311e79ac26928246f09c1>sstableexpiredblockers
 system peers
No sstables for system.peers

node111.ord.ae.tsg.cnvr.net:/ae/disk1/data/ae/raw_logs_by_user-f58b9960980311e79ac26928246f09c1>ls
 -l ../../system/peers-37f71aca7dc2383ba70672528af04d4f/
total 308
drwxr-xr-x. 2 vchadoop vchadoop 10 Sep 11 22:59 backups
-rw-rw-r--. 1 vchadoop vchadoop 83 Jan 25 02:11 
mc-137-big-CompressionInfo.db
-rw-rw-r--. 1 vchadoop vchadoop 180369 Jan 25 02:11 mc-137-big-Data.db
-rw-rw-r--. 1 vchadoop vchadoop 10 Jan 25 02:11 mc-137-big-Digest.crc32
-rw-rw-r--. 1 vchadoop vchadoop 64 Jan 25 02:11 mc-137-big-Filter.db
-rw-rw-r--. 1 vchadoop vchadoop386 Jan 25 02:11 mc-137-big-Index.db
-rw-rw-r--. 1 vchadoop vchadoop   5171 Jan 25 02:11 mc-137-big-Statistics.db
-rw-rw-r--. 1 vchadoop vchadoop 56 Jan 25 02:11 mc-137-big-Summary.db
-rw-rw-r--. 1 vchadoop vchadoop 92 Jan 25 02:11 mc-137-big-TOC.txt
-rw-rw-r--. 1 vchadoop vchadoop 43 Jan 29 21:11 
mc-138-big-CompressionInfo.db
-rw-rw-r--. 1 vchadoop vchadoop   9723 Jan 29 21:11 mc-138-big-Data.db
-rw-rw-r--. 1 vchadoop vchadoop 10 Jan 29 21:11 mc-138-big-Digest.crc32
-rw-rw-r--. 1 vchadoop vchadoop 16 Jan 29 21:11 mc-138-big-Filter.db
-rw-rw-r--. 1 vchadoop vchadoop 17 Jan 29 21:11 mc-138-big-Index.db
-rw-rw-r--. 1 vchadoop vchadoop   5015 Jan 29 21:11 mc-138-big-Statistics.db
-rw-rw-r--. 1 vchadoop vchadoop 56 Jan 29 21:11 mc-138-big-Summary.db
-rw-rw-r--. 1 vchadoop vchadoop 92 Jan 29 21:11 mc-138-big-TOC.txt
-rw-rw-r--. 1 vchadoop vchadoop 43 Jan 29 21:53 
mc-139-big-CompressionInfo.db
-rw-rw-r--. 1 vchadoop vchadoop  18908 Jan 29 21:53 mc-139-big-Data.db
-rw-rw-r--. 1 vchadoop vchadoop 10 Jan 29 21:53 mc-139-big-Digest.crc32
-rw-rw-r--. 1 vchadoop vchadoop 16 Jan 29 21:53 mc-139-big-Filter.db
-rw-rw-r--. 1 vchadoop vchadoop 36 Jan 29 21:53 mc-139-big-Index.db
-rw-rw-r--. 1 vchadoop vchadoop   5055 Jan 29 21:53 mc-139-big-Statistics.db
-rw-rw-r--. 1 vchadoop vchadoop 56 Jan 29 21:53 mc-139-big-Summary.db
-rw-rw-r--. 1 vchadoop vchadoop 92 Jan 29 21:53 mc-139-big-TOC.txt



From: Kenneth Brotman 
Date: Tuesday, January 30, 2018 at 7:37 AM
To: 
Subject: RE: TWCS not deleting expired sstables

Wow!  It’s in the DataStax documentation: 
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/toolsSStables/toolsSStabExpiredBlockers.html

Other nice tools there as well: 
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/toolsSStables/toolsSSTableUtilitiesTOC.html

Kenneth Brotman

From: kurt greaves [mailto:k...@instaclustr.com]
Sent: Monday, January 29, 2018 8:20 PM
To: User
Subject: Re: TWCS not deleting expired sstables

Likely a read repair caused old data to be brought into a newer SSTable. Try 
running sstableexpiredblockers to find out if there's a newer SSTable blocking 
that one from being dropped.​


Re: Heavy one-off writes best practices

2018-01-30 Thread Lucas Benevides
Hello Julien,

After reading the excelent post and video by Alain Rodriguez, maybe you
should read the paper Performance Tuning of Big Data Platform: Cassandra
Case Study

by SATHVIK KATAM. In the results he sets new values to memTable Cleanup
Threshold  and Key cache size.
Although it is not proven that the same results will persist in different
environments, it is a good starting point.

Lucas Benevides

2018-01-30 6:12 GMT-02:00 Julien Moumne :

> Hello, I am looking for best practices for the following use case :
>
> Once a day, we insert at the same time 10 full tables (several 100GiB
> each) using Spark C* driver, without batching, with CL set to ALL.
>
> Whether skinny rows or wide rows, data for a partition key is always
> completely updated / overwritten, ie. every command is an insert.
>
> This imposes a great load on the cluster (huge CPU consumption), this load
> greatly impacts the constant reads we have. Read latency are fine the rest
> of the time.
>
> Is there any best practices we should follow to ease the load when
> importing data into C* except
>  - reducing the number of concurrent writes and throughput on the driver
> side
>  - reducing the number of compaction threads and throughput on the cluster
>
> In particular :
>  - is there any evidence that writing multiple tables at the same time
> produces more load than writing the tables one at a time when tables are
> completely written at once such as we do?
>  - because of the heavy writes, we use STC. Is it the best choice
> considering data is completely overwritten once a day? Tables contain
> collections and UDTs.
>
> (We manage data expiration with TTL set to several days.
> We use SSDs.)
>
> Thanks!
>


RE: TWCS not deleting expired sstables

2018-01-30 Thread Kenneth Brotman
Wow!  It’s in the DataStax documentation: 
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/toolsSStables/toolsSStabExpiredBlockers.html

 

Other nice tools there as well: 
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/toolsSStables/toolsSSTableUtilitiesTOC.html

 

Kenneth Brotman

 

From: kurt greaves [mailto:k...@instaclustr.com] 
Sent: Monday, January 29, 2018 8:20 PM
To: User
Subject: Re: TWCS not deleting expired sstables

 

Likely a read repair caused old data to be brought into a newer SSTable. Try 
running sstableexpiredblockers to find out if there's a newer SSTable blocking 
that one from being dropped.​



RE: Commitlogs are filling the Full Disk space and nodes are down

2018-01-30 Thread Amit Singh
Hi,

 

When you actually say nodetool flush, data from memTable goes to disk based
structure as SStables and side by side , commit logs segments for that
particular data get written off and its continuous process . May be in your
case , you can decrease the value of  below uncommented property in
Cassandra.yaml 

 

commitlog_total_space_in_mb

 

Also this is what is it used for 

 

# Total space to use for commit logs on disk.

#

# If space gets above this value, Cassandra will flush every dirty CF

# in the oldest segment and remove it.  So a small total commitlog space

# will tend to cause more flush activity on less-active columnfamilies.

#

# The default value is the smaller of 8192, and 1/4 of the total space

# of the commitlog volume.

 

 

From: Mokkapati, Bhargav (Nokia - IN/Chennai)
[mailto:bhargav.mokkap...@nokia.com] 
Sent: Tuesday, January 30, 2018 4:00 PM
To: user@cassandra.apache.org
Subject: Commitlogs are filling the Full Disk space and nodes are down

 

Hi Team,

 

My Cassandra version : Apache Cassandra 3.0.13

 

Cassandra nodes are down due to Commitlogs are getting filled up until full
disk size.

 



 

With "Nodetool flush" I didn't see any commitlogs deleted.

 

Can anyone tell me how to flush the commitlogs without losing data.

 

Thanks,

Bhargav M



RE: Cassandra nodes are down

2018-01-30 Thread Amit Singh
Hello,

 

Please check in debug logs for detailed trace, here exact reason can't be
figure out. Try your luck there.

 

From: Mokkapati, Bhargav (Nokia - IN/Chennai)
[mailto:bhargav.mokkap...@nokia.com] 
Sent: Monday, January 29, 2018 11:09 PM
To: user@cassandra.apache.org
Cc: mbhargavna...@gmail.com
Subject: Cassandra nodes are down

 

Hi Team,

 

I'm getting the below warnings. Please help me out to clear these issues.

 

Apache Cassandra version : 3.0.13, 5 Node cluster

 

INFO  [main] 2018-01-29 16:58:19,487 NativeLibrary.java:167 - JNA mlockall
successful

WARN  [main] 2018-01-29 16:58:19,488 StartupChecks.java:121 - jemalloc
shared library could not be preloaded to speed up memory allocations

INFO  [main] 2018-01-29 16:58:19,488 StartupChecks.java:160 - JMX is enabled
to receive remote connections on port: 8002

WARN  [main] 2018-01-29 16:58:19,488 StartupChecks.java:178 - OpenJDK is not
recommended. Please upgrade to the newest Oracle Java release

INFO  [main] 2018-01-29 16:58:19,490 SigarLibrary.java:44 - Initializing
SIGAR library

WARN  [main] 2018-01-29 16:58:19,498 SigarLibrary.java:174 - Cassandra
server running in degraded mode. Is swap disabled? : true,  Address space
adequate? : true,  nofile limit adequate? : false, nproc limit adequate? :
true

WARN  [main] 2018-01-29 16:58:19,500 StartupChecks.java:246 - Maximum number
of memory map areas per process (vm.max_map_count) 65530 is too low,
recommended value: 1048575, you can change it with sysctl.

 

WARN  [main] 2018-01-29 17:05:07,844 SystemKeyspace.java:1042 - No host ID
found, created 2dc59352-e98e-4e77-a5f2-289697e467c7 (Note: This should
happen exactly once per node).

INFO  [main] 2018-01-29 17:05:16,421 Server.java:160 - Starting listening
for CQL clients on /10.50.21.22:9042 (unencrypted)...

INFO  [main] 2018-01-29 17:05:16,449 CassandraDaemon.java:488 - Not starting
RPC server as requested. Use JMX (StorageService->startRPCServer()) or
nodetool (enablethrift) to start it

INFO  [OptionalTasks:1] 2018-01-29 17:05:18,443
CassandraRoleManager.java:350 - Created default superuser role 'cassandra'

INFO  [StorageServiceShutdownHook] 2018-01-29 17:09:55,737
HintsService.java:212 - Paused hints dispatch

INFO  [StorageServiceShutdownHook] 2018-01-29 17:09:55,740 Server.java:180 -
Stop listening for CQL clients

INFO  [StorageServiceShutdownHook] 2018-01-29 17:09:55,740
Gossiper.java:1490 - Announcing shutdown

INFO  [StorageServiceShutdownHook] 2018-01-29 17:09:55,741
StorageService.java:1991 - Node /10.50.21.22 state jump to shutdown

INFO  [StorageServiceShutdownHook] 2018-01-29 17:09:57,743
MessagingService.java:811 - Waiting for messaging service to quiesce

INFO  [ACCEPT-/10.50.21.22] 2018-01-29 17:09:57,743
MessagingService.java:1110 - MessagingService has terminated the accept()
thread

INFO  [StorageServiceShutdownHook] 2018-01-29 17:09:57,797
HintsService.java:212 - Paused hints dispatch

 

Thanks,

Bhargav M.



Commitlogs are filling the Full Disk space and nodes are down

2018-01-30 Thread Mokkapati, Bhargav (Nokia - IN/Chennai)
Hi Team,

My Cassandra version : Apache Cassandra 3.0.13

Cassandra nodes are down due to Commitlogs are getting filled up until full 
disk size.

[cid:image001.jpg@01D399E3.666CF940]

With "Nodetool flush" I didn't see any commitlogs deleted.

Can anyone tell me how to flush the commitlogs without losing data.

Thanks,
Bhargav M


RE: Cleanup blocking snapshots - Options?

2018-01-30 Thread Steinmaurer, Thomas
Hi Kurt,

had another try now, and yes, with 2.1.18, this constantly happens. Currently 
running nodetool cleanup on a single node in production with disabled hourly 
snapshots. SSTables with > 100G involved here. Triggering nodetool snapshot 
will result in being blocked. From an operational perspective, a bit annoying 
right now 

Have asked on https://issues.apache.org/jira/browse/CASSANDRA-13873 regarding a 
backport to 2.1, but possibly won’t get attention, cause the ticket has been 
resolved for 2.2+ already.

Regards,
Thomas

From: kurt greaves [mailto:k...@instaclustr.com]
Sent: Montag, 15. Jänner 2018 06:18
To: User 
Subject: Re: Cleanup blocking snapshots - Options?

Disabling the snapshots is the best and only real option other than upgrading 
at the moment. Although apparently it was thought that there was only a small 
race condition in 2.1 that triggered this and it wasn't worth fixing. If you 
are triggering it easily maybe it is worth fixing in 2.1 as well. Does this 
happen consistently? Can you provide some more logs on the JIRA or better yet a 
way to reproduce?

On 14 January 2018 at 16:12, Steinmaurer, Thomas 
> 
wrote:
Hello,

we are running 2.1.18 with vnodes in production and due to 
(https://issues.apache.org/jira/browse/CASSANDRA-11155) we can’t run cleanup 
e.g. after extending the cluster without blocking our hourly snapshots.

What options do we have to get rid of partitions a node does not own anymore?

• Using a version which has this issue fixed, although upgrading to 
2.2+, due to various issues, is not an option at the moment

• Temporarily disabling the hourly cron job before starting cleanup and 
re-enable after cleanup has finished

• Any other way to re-write SSTables with data a node owns after a 
cluster scale out

Thanks,
Thomas

The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße
 
313

The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313


Re: Heavy one-off writes best practices

2018-01-30 Thread Alain RODRIGUEZ
Hi Julien,

Whether skinny rows or wide rows, data for a partition key is always
> completely updated / overwritten, ie. every command is an insert.


Insert and updates are kind of the same thing in Cassandra for standard
data types, as Cassandra appends the operation and do not actually update
any past data right away. My guess is you are actually  'updating'
existing columns, rows or partitions.

We manage data expiration with TTL set to several days.
>

I believe that for the reason mentioned above, this TTL only applies to
data that would not be overwritten. All the updated / reinserted data, is
resetting the TTL timer to the new value given to the column, range, row,
or partition.

This imposes a great load on the cluster (huge CPU consumption), this load
> greatly impacts the constant reads we have. Read latency are fine the rest
> of the time.
>

This is expected in writes heavy scenario. Writes are not touching the data
disk, thus, the CPU is often the bottleneck in this case. Also, it is known
that Spark (and similar distributed processing technologies) can harm
regular transactions.

Possible options to reduce the impact:

- Use a specific data center for analytics, within the same cluster, and
work locally there. Writes will still be replicated to the original DC
(asynchronously) but it will no longer be responsible for coordinating the
analytical jobs.
- Use a coordinator ring to delegate most of the work to this 'proxy layer'
between clients and Cassandra nodes (with data). A good starting point
could be: https://www.youtube.com/watch?v=K0sQvaxiDH0. I am not sure how
experimental or hard to deploy this architecture is, but I see there a
smart move, probably very good for some use case. Maybe yours?
- Simply limit the write speed in Spark if doable from a service
perspective or add nodes, so spark is never strong enough to break regular
transactions (this could be very expensive).
- Run Spark mostly on off-peak hours
- ... probably some more I cannot think of just now :).

Is there any best practices we should follow to ease the load when
> importing data into C* except
>  - reducing the number of concurrent writes and throughput on the driver
> side
>

Yes, as mentioned above, on throttling spark throughput if doable is
definitively a good idea. If not you might have terrible surprises if
someone from the dev team decides to add some more writes suddenly and
Cassandra side is not ready for it.

 - reducing the number of compaction threads and throughput on the cluster
>

Generally the number of compaction is well defined by default. You don't
want to use more than 1/4 or 1/2 of the total available and generally no
more than 8. Lowering the compaction throughput is a double-edged sword.
Yes it would free some disk throughput immediately. Yet if compactions are
stacking, SSTables are merging slowly and reads performances will decrease
substantially, quite fast, as each read will have to hit a lot of files
thus making an increasing number of reads. The throughput should be set to
a value that is fast enough to keep up with compactions.

If you really have to rewrite 100% of the data, every day, I would suggest
you to create 10 new tables every day instead of rewriting existing data.
Writing a new table 'MyAwesomeTable-20180130' for example and then simply
dropping the one from 2 or 3 days ago and cleaning the snapshot, might be
more efficient I would say. On the client side, it is about adding the date
(passed or calculated).

In particular :
>  - is there any evidence that writing multiple tables at the same time
> produces more load than writing the tables one at a time when tables are
> completely written at once such as we do?


I don't think so, excepted maybe that compactions within a single table
cannot be done all in parallel, thus you would probably limit the load a
bit in Cassandra. I am not even sure, a lot of progress was made in the
past to make compactions more efficient :).

 - because of the heavy writes, we use STC. Is it the best choice
> considering data is completely overwritten once a day? Tables contain
> collections and UDTs.


STCS sounds reasonable, I would not start tuning here. TWCS could be
considered to evict tombstones efficiently, but as I said earlier, I don't
think you have a lot of expired tombstones, I would guess compactions +
coordination for writes is being the cluster killer in your case, but
please, let us know how compactions and tombstones look like in your
cluster  .

- compactions: nodetool compactionstats -H / check pending compactions
- tombstones: use sstablemetadata on biggest / oldest SSTables or use
monitoring to check the ratio of droppable tombstones.

It's a very specific use case I never faced and I don't know your exact use
case, so I am mostly guessing here. I can be wrong on some of the points
above, but I am sure some people around will step in and correct me where
this is the case :). I ho

Re: Heavy one-off writes best practices

2018-01-30 Thread Alain RODRIGUEZ
I noticed I did not give the credits to Eric Lubow from SimpleReach. The
video mentioned above is a talk he gave at the Cassandra Summit 2016 :-).

2018-01-30 9:07 GMT+00:00 Alain RODRIGUEZ <arodr...@gmail.com>:

> Hi Julien,
>
> Whether skinny rows or wide rows, data for a partition key is always
>> completely updated / overwritten, ie. every command is an insert.
>
>
> Insert and updates are kind of the same thing in Cassandra for standard
> data types, as Cassandra appends the operation and do not actually update
> any past data right away. My guess is you are actually  'updating'
> existing columns, rows or partitions.
>
> We manage data expiration with TTL set to several days.
>>
>
> I believe that for the reason mentioned above, this TTL only applies to
> data that would not be overwritten. All the updated / reinserted data, is
> resetting the TTL timer to the new value given to the column, range, row,
> or partition.
>
> This imposes a great load on the cluster (huge CPU consumption), this load
>> greatly impacts the constant reads we have. Read latency are fine the rest
>> of the time.
>>
>
> This is expected in writes heavy scenario. Writes are not touching the
> data disk, thus, the CPU is often the bottleneck in this case. Also, it is
> known that Spark (and similar distributed processing technologies) can harm
> regular transactions.
>
> Possible options to reduce the impact:
>
> - Use a specific data center for analytics, within the same cluster, and
> work locally there. Writes will still be replicated to the original DC
> (asynchronously) but it will no longer be responsible for coordinating the
> analytical jobs.
> - Use a coordinator ring to delegate most of the work to this 'proxy
> layer' between clients and Cassandra nodes (with data). A good starting
> point could be: https://www.youtube.com/watch?v=K0sQvaxiDH0. I am not
> sure how experimental or hard to deploy this architecture is, but I see
> there a smart move, probably very good for some use case. Maybe yours?
> - Simply limit the write speed in Spark if doable from a service
> perspective or add nodes, so spark is never strong enough to break regular
> transactions (this could be very expensive).
> - Run Spark mostly on off-peak hours
> - ... probably some more I cannot think of just now :).
>
> Is there any best practices we should follow to ease the load when
>> importing data into C* except
>>  - reducing the number of concurrent writes and throughput on the driver
>> side
>>
>
> Yes, as mentioned above, on throttling spark throughput if doable is
> definitively a good idea. If not you might have terrible surprises if
> someone from the dev team decides to add some more writes suddenly and
> Cassandra side is not ready for it.
>
>  - reducing the number of compaction threads and throughput on the cluster
>>
>
> Generally the number of compaction is well defined by default. You don't
> want to use more than 1/4 or 1/2 of the total available and generally no
> more than 8. Lowering the compaction throughput is a double-edged sword.
> Yes it would free some disk throughput immediately. Yet if compactions are
> stacking, SSTables are merging slowly and reads performances will decrease
> substantially, quite fast, as each read will have to hit a lot of files
> thus making an increasing number of reads. The throughput should be set to
> a value that is fast enough to keep up with compactions.
>
> If you really have to rewrite 100% of the data, every day, I would suggest
> you to create 10 new tables every day instead of rewriting existing data.
> Writing a new table 'MyAwesomeTable-20180130' for example and then simply
> dropping the one from 2 or 3 days ago and cleaning the snapshot, might be
> more efficient I would say. On the client side, it is about adding the date
> (passed or calculated).
>
> In particular :
>>  - is there any evidence that writing multiple tables at the same time
>> produces more load than writing the tables one at a time when tables are
>> completely written at once such as we do?
>
>
> I don't think so, excepted maybe that compactions within a single table
> cannot be done all in parallel, thus you would probably limit the load a
> bit in Cassandra. I am not even sure, a lot of progress was made in the
> past to make compactions more efficient :).
>
>  - because of the heavy writes, we use STC. Is it the best choice
>> considering data is completely overwritten once a day? Tables contain
>> collections and UDTs.
>
>
> STCS sounds reasonable, I would not start tuning here. TWCS could be
> considered to evict tombstones efficiently, but as I said earlier, I don't
> th

Heavy one-off writes best practices

2018-01-30 Thread Julien Moumne
Hello, I am looking for best practices for the following use case :

Once a day, we insert at the same time 10 full tables (several 100GiB each)
using Spark C* driver, without batching, with CL set to ALL.

Whether skinny rows or wide rows, data for a partition key is always
completely updated / overwritten, ie. every command is an insert.

This imposes a great load on the cluster (huge CPU consumption), this load
greatly impacts the constant reads we have. Read latency are fine the rest
of the time.

Is there any best practices we should follow to ease the load when
importing data into C* except
 - reducing the number of concurrent writes and throughput on the driver
side
 - reducing the number of compaction threads and throughput on the cluster

In particular :
 - is there any evidence that writing multiple tables at the same time
produces more load than writing the tables one at a time when tables are
completely written at once such as we do?
 - because of the heavy writes, we use STC. Is it the best choice
considering data is completely overwritten once a day? Tables contain
collections and UDTs.

(We manage data expiration with TTL set to several days.
We use SSDs.)

Thanks!