Re: vnodes: high availability

2018-01-16 Thread kurt greaves
Even with a low amount of vnodes you're asking for a bad time. Even if you
managed to get down to 2 vnodes per node, you're still likely to include
double the amount of nodes in any streaming/repair operation which will
likely be very problematic for incremental repairs, and you still won't be
able to easily reason about which nodes are responsible for which token
ranges. It's still quite likely that a loss of 2 nodes would mean some
portion of the ring is down (at QUORUM). At the moment I'd say steer clear
of vnodes and use single tokens if you can; a lot of work still needs to be
done to ensure smooth operation of C* while using vnodes, and they are much
more difficult to reason about (which is probably the reason no one has
bothered to do the math). If you're really keen on the math your best bet
is to do it yourself, because it's not a point of interest for many C* devs
plus probably a lot of us wouldn't remember enough math to know how to
approach it.

If you want to get out of this situation you'll need to do a DC migration
to a new DC with a better configuration of snitch/replication
strategy/racks/tokens.


On 16 January 2018 at 21:54, Kyrylo Lebediev 
wrote:

> Thank you for this valuable info, Jon.
> I guess both you and Alex are referring to improved vnodes allocation
> method  https://issues.apache.org/jira/browse/CASSANDRA-7032 which was
> implemented in 3.0.
>
> Based on your info and comments in the ticket it's really a bad idea to
> have small number of vnodes for the versions using old allocation method
> because of hot-spots, so it's not an option for my particular case (v.2.1)
> :(
>
> [As far as I can see from the source code this new method
> wasn't backported to 2.1.]
>
>
>
> Regards,
>
> Kyrill
> [CASSANDRA-7032] Improve vnode allocation - ASF JIRA
> 
> issues.apache.org
> It's been known for a little while that random vnode allocation causes
> hotspots of ownership. It should be possible to improve dramatically on
> this with deterministic ...
>
> --
> *From:* Jon Haddad  on behalf of Jon Haddad <
> j...@jonhaddad.com>
> *Sent:* Tuesday, January 16, 2018 8:21:33 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
> We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the
> randomness.  There’s going to be some imbalance, the amount of imbalance
> depends on luck, unfortunately.
>
> I’m interested to hear your results using 4 tokens, would you mind letting
> the ML know your experience when you’ve done it?
>
> Jon
>
> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev 
> wrote:
>
> Agree with you, Jon.
> Actually, this cluster was configured by my 'predecessor' and [fortunately
> for him] we've never met :)
> We're using version 2.1.15 and can't upgrade because of legacy Netflix
> Astyanax client used.
>
> Below in the thread Alex mentioned that it's recommended to set vnodes to
> a value lower than 256 only for C* version > 3.0 (token allocation
> algorithm was improved since C* 3.0) .
>
> Jon,
> Do you have positive experience setting up  cluster with vnodes < 256 for
> C* 2.1?
>
> vnodes=32 also too high, as for me (we need to have much more than 32
> servers per AZ in order to to get 'reliable' cluster)
> vnodes=4 seems to be better from HA + balancing trade-off
>
> Thanks,
> Kyrill
> --
> *From:* Jon Haddad  on behalf of Jon Haddad <
> j...@jonhaddad.com>
> *Sent:* Tuesday, January 16, 2018 6:44:53 PM
> *To:* user
> *Subject:* Re: vnodes: high availability
>
> While all the token math is helpful, I have to also call out the elephant
> in the room:
>
> You have not correctly configured Cassandra for production.
>
> If you had used the correct endpoint snitch & network topology strategy,
> you would be able to withstand the complete failure of an entire
> availability zone at QUORUM, or two if you queried at CL=ONE.
>
> You are correct about 256 tokens causing issues, it’s one of the reasons
> why we recommend 32.  I’m curious how things behave going as low as 4,
> personally, but I haven’t done the math / tested it yet.
>
>
>
> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev 
> wrote:
>
> ...to me it sounds like 'C* isn't that highly-available by design as it's
> declared'.
> More nodes in a cluster means higher probability of simultaneous node
> failures.
> And from high-availability standpoint, looks like situation is made even
> worse by recommended setting vnodes=256.
>
> Need to do some math to get numbers/formulas, but now situation doesn't
> seem to be promising.
> In case smb from C* developers/architects is reading this message, I'd be
> grateful to get some links to calculations of C* reliability based on which
> decisions were made.
>
> Regards,
> Kyrill
> --
> *From:* 

Re: New token allocation and adding a new DC

2018-01-16 Thread kurt greaves
I believe you are able to get away with just altering the keyspace to
include both DC's even before the DC exists, and then adding your nodes to
that new DC using the algorithm. Note you'll probably want to take the
opportunity to reduce the number of vnodes to something reasonable. Based
off memory from previous testing you can get a good token balance with 16
vnodes if you have at least 6 nodes per rack (with RF=3 and 3 racks).


On 16 January 2018 at 16:02, Oleksandr Shulgin  wrote:

> On Tue, Jan 16, 2018 at 4:16 PM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Oleksandr,
>>
>> if bootstrap is disabled, it will only skip the streaming phase but will
>> still go through token allocation and thus should use the new algorithm.
>> The algorithm won't try to spread data based on size on disk but it will
>> try to spread token ownership as evenly as possible.
>>
>> The problem you'll run into is that ownership for a specific keyspace
>> will be null as long as the replication strategy isn't updated to create
>> replicas on the new DC.
>> Quickly thinking would make me do the following :
>>
>>- Create enough nodes in the new DC to match the target replication
>>factor
>>- Alter the replication strategy to add the target number of replicas
>>in the new DC (they will start getting writes, and hopefully you've 
>> already
>>segregated reads)
>>- Continue adding nodes in the new DC (with auto_bootstrap = false),
>>specifying the right keyspace to optimize token allocations
>>- Run rebuild on all nodes in the new DC
>>
>> I honestly never used it but that's my understanding of how it should
>> work.
>>
>
> Oh, that's neat.  We will try this and see if it helps.
>
> Thank you!
> --
> Alex
>
>


RE: Slender Cassandra Cluster Project

2018-01-16 Thread Kenneth Brotman
Sure.  That takes the project from awesome to 10X awesome.  I absolutely would 
be willing to do that.  Thanks Kurt!

 

Regarding your comment on the keyspaces, I agree.  There should be a few simple 
examples one way or the other that can be duplicated and observed, and then an 
example to duplicate and play with that has a nice real world mix, with some 
keyspaces that replicate over only a subset of DC’s and some that replicate to 
all DC’s.

 

Kenneth Brotman 

 

From: kurt greaves [mailto:k...@instaclustr.com] 
Sent: Tuesday, January 16, 2018 1:31 PM
To: User
Subject: Re: Slender Cassandra Cluster Project

 

Sounds like a great idea. Probably would be valuable to add to the official 
docs as an example set up if you're willing.

 

Only thing I'd add is that you should have keyspaces that replicate over only a 
subset of DC's, plus one/some replicated to all DC's

 

On 17 Jan. 2018 03:26, "Kenneth Brotman"  wrote:

I’ve begun working on a reference project intended to provide guidance on 
configuring and operating a modest Cassandra cluster of about 18 nodes suitable 
for the economic study, demonstration, experimentation and testing of a 
Cassandra cluster.

 

The slender cluster would be designed to be as inexpensive as possible while 
still using real world hardware in order to lower the cost to those with 
limited initial resources. Sorry no Raspberry Pi’s for this project.  

 

There would be an on-premises version and a cloud version.  Guidance would be 
provided on configuring the cluster, on demonstrating key Cassandra behaviors, 
on files sizes, capacity to use with the Slender Cassandra Cluster, and so on.

 

Why about eighteen nodes? I tried to figure out what the minimum number of 
nodes needed for Cassandra to be Cassandra is?  Here were my considerations:

 

• A user wouldn’t run Cassandra in just one data center; so at 
least two datacenters.

• A user probably would want a third data center available for 
analytics.

• There needs to be enough nodes for enough parallelism to observe 
Cassandra’s distributed nature.

• The cluster should have enough nodes that one gets a sense of the 
need for cluster wide management tools to do things like repairs, snapshots and 
cluster monitoring.

• The cluster should be able to demonstrate a RF=3 with local 
quorum.  If replicated in all three data centers, one write would impact half 
the 18 nodes, 3 datacenters X 3 nodes per data center = 9 nodes of 18 nodes.  
If replicated in two of the data centers, one write would still impact one 
third of the 18 nodes, 2 DC’s X 3 nodes per DC = 6 of the 18 nodes.  

 

So eighteen seems like the minimum number of nodes needed.  That’s six nodes in 
each of three data centers.

 

Before I get too carried away with this project, I’m looking for some feedback 
on whether this project would indeed be helpful to others? Also, should the 
project be changed in any way?

 

It’s always a pleasure to connect with the Cassandra users’ community.  Thanks 
for all the hard work, the expertise, the civil dialog.

 

Kenneth Brotman



Re: vnodes: high availability

2018-01-16 Thread Kyrylo Lebediev
Thank you for this valuable info, Jon.
I guess both you and Alex are referring to improved vnodes allocation method  
https://issues.apache.org/jira/browse/CASSANDRA-7032 which was implemented in 
3.0.

Based on your info and comments in the ticket it's really a bad idea to have 
small number of vnodes for the versions using old allocation method because of 
hot-spots, so it's not an option for my particular case (v.2.1) :(

[As far as I can see from the source code this new method wasn't backported to 
2.1.]



Regards,

Kyrill

[CASSANDRA-7032] Improve vnode allocation - ASF 
JIRA
issues.apache.org
It's been known for a little while that random vnode allocation causes hotspots 
of ownership. It should be possible to improve dramatically on this with 
deterministic ...



From: Jon Haddad  on behalf of Jon Haddad 

Sent: Tuesday, January 16, 2018 8:21:33 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  
There’s going to be some imbalance, the amount of imbalance depends on luck, 
unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the 
ML know your experience when you’ve done it?

Jon

On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev 
> wrote:

Agree with you, Jon.
Actually, this cluster was configured by my 'predecessor' and [fortunately for 
him] we've never met :)
We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax 
client used.

Below in the thread Alex mentioned that it's recommended to set vnodes to a 
value lower than 256 only for C* version > 3.0 (token allocation algorithm was 
improved since C* 3.0) .

Jon,
Do you have positive experience setting up  cluster with vnodes < 256 for  C* 
2.1?

vnodes=32 also too high, as for me (we need to have much more than 32 servers 
per AZ in order to to get 'reliable' cluster)
vnodes=4 seems to be better from HA + balancing trade-off

Thanks,
Kyrill

From: Jon Haddad > 
on behalf of Jon Haddad >
Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in 
the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you 
would be able to withstand the complete failure of an entire availability zone 
at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we 
recommend 32.  I’m curious how things behave going as low as 4, personally, but 
I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev 
> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.
In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.

Regards,
Kyrill

From: kurt greaves >
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into 

Re: Slender Cassandra Cluster Project

2018-01-16 Thread kurt greaves
Sounds like a great idea. Probably would be valuable to add to the official
docs as an example set up if you're willing.

Only thing I'd add is that you should have keyspaces that replicate over
only a subset of DC's, plus one/some replicated to all DC's

On 17 Jan. 2018 03:26, "Kenneth Brotman" 
wrote:

> I’ve begun working on a reference project intended to provide guidance on
> configuring and operating a modest Cassandra cluster of about 18 nodes
> suitable for the economic study, demonstration, experimentation and testing
> of a Cassandra cluster.
>
>
>
> The slender cluster would be designed to be as inexpensive as possible
> while still using real world hardware in order to lower the cost to those
> with limited initial resources. Sorry no Raspberry Pi’s for this project.
>
>
>
> There would be an on-premises version and a cloud version.  Guidance would
> be provided on configuring the cluster, on demonstrating key Cassandra
> behaviors, on files sizes, capacity to use with the Slender Cassandra
> Cluster, and so on.
>
>
>
> Why about eighteen nodes? I tried to figure out what the minimum number of
> nodes needed for Cassandra to be Cassandra is?  Here were my considerations:
>
>
>
> • A user wouldn’t run Cassandra in just one data center; so at
> least two datacenters.
>
> • A user probably would want a third data center available for
> analytics.
>
> • There needs to be enough nodes for enough parallelism to
> observe Cassandra’s distributed nature.
>
> • The cluster should have enough nodes that one gets a sense
> of the need for cluster wide management tools to do things like repairs,
> snapshots and cluster monitoring.
>
> • The cluster should be able to demonstrate a RF=3 with local
> quorum.  If replicated in all three data centers, one write would impact
> half the 18 nodes, 3 datacenters X 3 nodes per data center = 9 nodes of 18
> nodes.  If replicated in two of the data centers, one write would still
> impact one third of the 18 nodes, 2 DC’s X 3 nodes per DC = 6 of the 18
> nodes.
>
>
>
> So eighteen seems like the minimum number of nodes needed.  That’s six
> nodes in each of three data centers.
>
>
>
> Before I get too carried away with this project, I’m looking for some
> feedback on whether this project would indeed be helpful to others? Also,
> should the project be changed in any way?
>
>
>
> It’s always a pleasure to connect with the Cassandra users’ community.
> Thanks for all the hard work, the expertise, the civil dialog.
>
>
>
> Kenneth Brotman
>


Re: Is it recommended to enable debug log in production

2018-01-16 Thread Jon Haddad
In certain versions (2.2 specifically) I’ve seen a massive performance hit from 
the extra logging in some very specific circumstances.  In the case I looked at 
it was due to the added overhead of reflection.  The issue I found was resolved 
in 3.0 (I think), but I always disable DEBUG logging now anyways, just in case. 

> On Jan 16, 2018, at 11:01 AM, Jay Zhuang  wrote:
> 
> Hi,
> 
> Do you guys enable debug log in production? Is it recommended?
> 
> By default, the cassandra log level is set to debug:
> https://github.com/apache/cassandra/blob/trunk/conf/logback.xml#L100
> 
> We’re using 3.0.x, which generates lots of Gossip messages:
> FailureDetector.java:456 - Ignoring interval time of 2001193771 for /IP
> 
> Probably we should back port 
> https://github.com/apache/cassandra/commit/9ac01baef5c8f689e96307da9b29314bc0672462
> Other than that, do you guys see any other issue?
> 
> Thanks,
> Jay
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Is it recommended to enable debug log in production

2018-01-16 Thread Jay Zhuang
Hi,

Do you guys enable debug log in production? Is it recommended?

By default, the cassandra log level is set to debug:
https://github.com/apache/cassandra/blob/trunk/conf/logback.xml#L100

We’re using 3.0.x, which generates lots of Gossip messages:
FailureDetector.java:456 - Ignoring interval time of 2001193771 for /IP

Probably we should back port 
https://github.com/apache/cassandra/commit/9ac01baef5c8f689e96307da9b29314bc0672462
Other than that, do you guys see any other issue?

Thanks,
Jay
-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: TWCS and autocompaction

2018-01-16 Thread Alexander Dejanovski
The ticket I was referring to is the following :
https://issues.apache.org/jira/browse/CASSANDRA-13418

It's been merged in 3.11.1, so just make sure you enable
unsafe_aggressive_sstable_expiration and you'll evict expired SSTables
regardless of overlaps (and IMHO it's totally safe to do this).
Do not ever run major compactions on TWCS tables unless you have a really,
really valid reason, and do not ever disable autocompaction on any table
for a long time.

Foreground read repair will still happen, regardless your settings, when
reading at QUORUM or LOCAL_QUORUM, that's just part of the read path.
read_repair_chance and dc_read_repair_chance set to 0.0 will only disable
background read repair, which also happens at other consistency levels.

Currently, you have a default TTL of 1555200 and a 4 hours time window,
which can create up to 108 live buckets.
The advice Jeff Jirsa gave back in the days is to try to keep the number of
live buckets between 50 and 60, which means you should double the size of
your time windows to 8 hours.

If you end up with 100 SSTables, then TWCS is properly doing its work,
keeping in mind that the current time window can/will have more than one
SSTable. Major compaction within a bucket will happen once it gets out of
the current time window.

Cheers,


On Tue, Jan 16, 2018 at 7:16 PM Cogumelos Maravilha <
cogumelosmaravi...@sapo.pt> wrote:

> Hi,
>
> My read_repair_chance is 0 (AND read_repair_chance = 0.0)
>
> When I bootstrap a new node there is around 700 sstables, but after auto
> compaction the number drop to around 100.
>
> I'm using C* 3.11.1. To solve the problem I've already changed to
> 'unchecked_tombstone_compaction': 'true'. Now should I run nodetool compact?
>
> And for the future crontab nodetool disableautocompaction?
>
> Thanks
>
> On 16-01-2018 11:35, Alexander Dejanovski wrote:
>
> Hi,
>
> The overlaps you're seeing on time windows aren't due to automatic
> compactions, but to read repairs.
> You must be reading at quorum or local_quorum which can perform foreground
> read repair in case of digest mismatch.
>
> You can set unchecked_tombstone_compaction to true if you want to perform
> single sstable compaction to purge tombstones and a patch has recently been
> merged in to allow twcs to delete fully expired data even in case of
> overlap between time windows (I can't remember if it's been merged in
> 3.11.1).
> Just so you know, the timestamp considered for time windows is the max
> timestamp. You can have old data in recent time windows, but not the
> opposite.
>
> Cheers,
>
> Le mar. 16 janv. 2018 à 12:07, Cogumelos Maravilha <
> cogumelosmaravi...@sapo.pt> a écrit :
>
>> Hi list,
>>
>> My settings:
>>
>> AND compaction = {'class':
>> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
>> 'compaction_window_size': '4', 'compaction_window_unit': 'HOURS',
>> 'enabled': 'true', 'max_threshold': '64', 'min_threshold': '2',
>> 'tombstone_compaction_interval': '15000', 'tombstone_threshold': '0.2',
>> 'unchecked_tombstone_compaction': 'false'}
>> AND compression = {'chunk_length_in_kb': '64', 'class':
>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>> AND crc_check_chance = 0.0
>> AND dclocal_read_repair_chance = 0.0
>> AND default_time_to_live = 1555200
>> AND gc_grace_seconds = 10800
>> AND max_index_interval = 2048
>> AND memtable_flush_period_in_ms = 0
>> AND min_index_interval = 128
>> AND read_repair_chance = 0.0
>> AND speculative_retry = '99PERCENTILE';
>>
>> Running this script:
>>
>> for f in *Data.db; do
>>ls -lrt $f
>>output=$(sstablemetadata $f 2>/dev/null)
>>max=$(echo "$output" | grep Maximum\ timestamp | cut -d" " -f3 | cut
>> -c 1-10)
>>min=$(echo "$output" | grep Minimum\ timestamp | cut -d" " -f3 | cut
>> -c 1-10)
>>date -d @$max +'%d/%m/%Y %H:%M:%S'
>>date -d @$min +'%d/%m/%Y %H:%M:%S'
>> done
>>
>> on sstables I'm getting values like these:
>>
>> -rw-r--r-- 1 cassandra cassandra 12137573577 <(213)%20757-3577> Jan 14
>> 20:08
>> mc-22750-big-Data.db
>> 14/01/2018 19:57:41
>> 31/12/2017 19:06:48
>>
>> -rw-r--r-- 1 cassandra cassandra 4669422106 Jan 14 06:55
>> mc-22322-big-Data.db
>> 12/01/2018 07:59:57
>> 28/12/2017 19:08:42
>>
>> My goal is using TWCS for sstables expired fast because lots of new data
>> is coming in. What is the best approach to archive that? Should I
>> disable auto compaction?
>> Thanks in advance.
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


Re: vnodes: high availability

2018-01-16 Thread Jon Haddad
We’ve used 32 tokens pre 3.0.  It’s been a mixed result due to the randomness.  
There’s going to be some imbalance, the amount of imbalance depends on luck, 
unfortunately.

I’m interested to hear your results using 4 tokens, would you mind letting the 
ML know your experience when you’ve done it?

Jon

> On Jan 16, 2018, at 9:40 AM, Kyrylo Lebediev  wrote:
> 
> Agree with you, Jon.
> Actually, this cluster was configured by my 'predecessor' and [fortunately 
> for him] we've never met :)
> We're using version 2.1.15 and can't upgrade because of legacy Netflix 
> Astyanax client used.
> 
> Below in the thread Alex mentioned that it's recommended to set vnodes to a 
> value lower than 256 only for C* version > 3.0 (token allocation algorithm 
> was improved since C* 3.0) .
> 
> Jon,  
> Do you have positive experience setting up  cluster with vnodes < 256 for  C* 
> 2.1? 
> 
> vnodes=32 also too high, as for me (we need to have much more than 32 servers 
> per AZ in order to to get 'reliable' cluster)
> vnodes=4 seems to be better from HA + balancing trade-off
> 
> Thanks, 
> Kyrill
> From: Jon Haddad  on behalf of Jon Haddad 
> 
> Sent: Tuesday, January 16, 2018 6:44:53 PM
> To: user
> Subject: Re: vnodes: high availability
>  
> While all the token math is helpful, I have to also call out the elephant in 
> the room:
> 
> You have not correctly configured Cassandra for production.
> 
> If you had used the correct endpoint snitch & network topology strategy, you 
> would be able to withstand the complete failure of an entire availability 
> zone at QUORUM, or two if you queried at CL=ONE. 
> 
> You are correct about 256 tokens causing issues, it’s one of the reasons why 
> we recommend 32.  I’m curious how things behave going as low as 4, 
> personally, but I haven’t done the math / tested it yet.
> 
> 
> 
>> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev > > wrote:
>> 
>> ...to me it sounds like 'C* isn't that highly-available by design as it's 
>> declared'.
>> More nodes in a cluster means higher probability of simultaneous node 
>> failures.
>> And from high-availability standpoint, looks like situation is made even 
>> worse by recommended setting vnodes=256.
>> 
>> Need to do some math to get numbers/formulas, but now situation doesn't seem 
>> to be promising.
>> In case smb from C* developers/architects is reading this message, I'd be 
>> grateful to get some links to calculations of C* reliability based on which 
>> decisions were made.  
>> 
>> Regards, 
>> Kyrill
>> From: kurt greaves >
>> Sent: Tuesday, January 16, 2018 2:16:34 AM
>> To: User
>> Subject: Re: vnodes: high availability
>>  
>> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
>> intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>> 
>> If node A goes down all 256 ranges will go down, and considering there are 
>> only 49 other nodes all with 256 vnodes each, it's very likely that every 
>> node will be responsible for some range A was also responsible for. I'm not 
>> sure what the exact math is, but think of it this way: If on each node, any 
>> of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 
>> token ranges) on the ring with a token range on node A those token ranges 
>> will be down at QUORUM. 
>> 
>> Because token range assignment just uses rand() under the hood, I'm sure you 
>> could prove that it's always going to be the case that any 2 nodes going 
>> down result in a loss of QUORUM for some token range.
>> 
>> On 15 January 2018 at 19:59, Kyrylo Lebediev > > wrote:
>> Thanks Alexander!
>> 
>> I'm not a MS in math too) Unfortunately.
>> 
>> Not sure, but it seems to me that probability of 2/49 in your explanation 
>> doesn't take into account that vnodes endpoints are almost evenly 
>> distributed across all nodes (al least it's what I can see from "nodetool 
>> ring" output).
>> 
>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>  
>> 
>> of course this vnodes illustration is a theoretical one, but there no 2 
>> nodes on that diagram that can be switched off without losing a key range 
>> (at CL=QUORUM). 
>> 
>> That's because vnodes_per_node=8 > Nnodes=6.
>> As far as I understand, situation is getting worse with increase of 
>> vnodes_per_node/Nnode ratio.
>> Please, correct me if I'm wrong.
>> 
>> How would the situation differ from this example by DataStax, if we had a 
>> real-life 6-nodes cluster with 8 vnodes on each node? 
>> 
>> Regards, 
>> Kyrill
>> 
>> From: Alexander Dejanovski > 

Re: TWCS and autocompaction

2018-01-16 Thread Cogumelos Maravilha
Hi,

My read_repair_chance is 0 (AND read_repair_chance = 0.0)

When I bootstrap a new node there is around 700 sstables, but after auto
compaction the number drop to around 100.

I'm using C* 3.11.1. To solve the problem I've already changed to
'unchecked_tombstone_compaction': 'true'. Now should I run nodetool compact?

And for the future crontab nodetool disableautocompaction?

Thanks


On 16-01-2018 11:35, Alexander Dejanovski wrote:
>
> Hi,
>
> The overlaps you're seeing on time windows aren't due to automatic
> compactions, but to read repairs.
> You must be reading at quorum or local_quorum which can perform
> foreground read repair in case of digest mismatch.
>
> You can set unchecked_tombstone_compaction to true if you want to
> perform single sstable compaction to purge tombstones and a patch has
> recently been merged in to allow twcs to delete fully expired data
> even in case of overlap between time windows (I can't remember if it's
> been merged in 3.11.1).
> Just so you know, the timestamp considered for time windows is the max
> timestamp. You can have old data in recent time windows, but not the
> opposite.
>
> Cheers,
>
>
> Le mar. 16 janv. 2018 à 12:07, Cogumelos Maravilha
> > a écrit :
>
> Hi list,
>
> My settings:
>
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_size': '4', 'compaction_window_unit': 'HOURS',
> 'enabled': 'true', 'max_threshold': '64', 'min_threshold': '2',
> 'tombstone_compaction_interval': '15000', 'tombstone_threshold':
> '0.2',
> 'unchecked_tombstone_compaction': 'false'}
>     AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io
> .compress.LZ4Compressor'}
>     AND crc_check_chance = 0.0
>     AND dclocal_read_repair_chance = 0.0
>     AND default_time_to_live = 1555200
>     AND gc_grace_seconds = 10800
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair_chance = 0.0
>     AND speculative_retry = '99PERCENTILE';
>
> Running this script:
>
> for f in *Data.db; do
>    ls -lrt $f
>    output=$(sstablemetadata $f 2>/dev/null)
>    max=$(echo "$output" | grep Maximum\ timestamp | cut -d" " -f3
> | cut
> -c 1-10)
>    min=$(echo "$output" | grep Minimum\ timestamp | cut -d" " -f3
> | cut
> -c 1-10)
>    date -d @$max +'%d/%m/%Y %H:%M:%S'
>    date -d @$min +'%d/%m/%Y %H:%M:%S'
> done
>
> on sstables I'm getting values like these:
>
> -rw-r--r-- 1 cassandra cassandra 12137573577 Jan 14 20:08
> mc-22750-big-Data.db
> 14/01/2018 19:57:41
> 31/12/2017 19:06:48
>
> -rw-r--r-- 1 cassandra cassandra 4669422106 Jan 14 06:55
> mc-22322-big-Data.db
> 12/01/2018 07:59:57
> 28/12/2017 19:08:42
>
> My goal is using TWCS for sstables expired fast because lots of
> new data
> is coming in. What is the best approach to archive that? Should I
> disable auto compaction?
> Thanks in advance.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> 
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 
>
> -- 
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com 



Re: vnodes: high availability

2018-01-16 Thread Kyrylo Lebediev
Agree with you, Jon.

Actually, this cluster was configured by my 'predecessor' and [fortunately for 
him] we've never met :)

We're using version 2.1.15 and can't upgrade because of legacy Netflix Astyanax 
client used.


Below in the thread Alex mentioned that it's recommended to set vnodes to a 
value lower than 256 only for C* version > 3.0 (token allocation algorithm was 
improved since C* 3.0) .


Jon,

Do you have positive experience setting up  cluster with vnodes < 256 for  C* 
2.1?


vnodes=32 also too high, as for me (we need to have much more than 32 servers 
per AZ in order to to get 'reliable' cluster)

vnodes=4 seems to be better from HA + balancing trade-off


Thanks,

Kyrill


From: Jon Haddad  on behalf of Jon Haddad 

Sent: Tuesday, January 16, 2018 6:44:53 PM
To: user
Subject: Re: vnodes: high availability

While all the token math is helpful, I have to also call out the elephant in 
the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you 
would be able to withstand the complete failure of an entire availability zone 
at QUORUM, or two if you queried at CL=ONE.

You are correct about 256 tokens causing issues, it’s one of the reasons why we 
recommend 32.  I’m curious how things behave going as low as 4, personally, but 
I haven’t done the math / tested it yet.



On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev 
> wrote:

...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.
More nodes in a cluster means higher probability of simultaneous node failures.
And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.

Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.
In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.

Regards,
Kyrill

From: kurt greaves >
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
> wrote:
Thanks Alexander!

I'm not a MS in math too) Unfortunately.

Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into account that vnodes endpoints are almost evenly distributed 
across all nodes (al least it's what I can see from "nodetool ring" output).

http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes 
on that diagram that can be switched off without losing a key range (at 
CL=QUORUM).

That's because vnodes_per_node=8 > Nnodes=6.
As far as I understand, situation is getting worse with increase of 
vnodes_per_node/Nnode ratio.
Please, correct me if I'm wrong.

How would the situation differ from this example by DataStax, if we had a 
real-life 6-nodes cluster with 8 vnodes on each node?

Regards,
Kyrill


From: Alexander Dejanovski 
>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

I was corrected off list that the odds of losing data when 2 nodes are down 
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.
I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which 
is fairly theoretical.
If you're using 

Re: vnodes: high availability

2018-01-16 Thread Jon Haddad
While all the token math is helpful, I have to also call out the elephant in 
the room:

You have not correctly configured Cassandra for production.

If you had used the correct endpoint snitch & network topology strategy, you 
would be able to withstand the complete failure of an entire availability zone 
at QUORUM, or two if you queried at CL=ONE. 

You are correct about 256 tokens causing issues, it’s one of the reasons why we 
recommend 32.  I’m curious how things behave going as low as 4, personally, but 
I haven’t done the math / tested it yet.



> On Jan 16, 2018, at 2:02 AM, Kyrylo Lebediev  wrote:
> 
> ...to me it sounds like 'C* isn't that highly-available by design as it's 
> declared'.
> More nodes in a cluster means higher probability of simultaneous node 
> failures.
> And from high-availability standpoint, looks like situation is made even 
> worse by recommended setting vnodes=256.
> 
> Need to do some math to get numbers/formulas, but now situation doesn't seem 
> to be promising.
> In case smb from C* developers/architects is reading this message, I'd be 
> grateful to get some links to calculations of C* reliability based on which 
> decisions were made.  
> 
> Regards, 
> Kyrill
> From: kurt greaves 
> Sent: Tuesday, January 16, 2018 2:16:34 AM
> To: User
> Subject: Re: vnodes: high availability
>  
> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
> intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
> 
> If node A goes down all 256 ranges will go down, and considering there are 
> only 49 other nodes all with 256 vnodes each, it's very likely that every 
> node will be responsible for some range A was also responsible for. I'm not 
> sure what the exact math is, but think of it this way: If on each node, any 
> of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1 
> token ranges) on the ring with a token range on node A those token ranges 
> will be down at QUORUM. 
> 
> Because token range assignment just uses rand() under the hood, I'm sure you 
> could prove that it's always going to be the case that any 2 nodes going down 
> result in a loss of QUORUM for some token range.
> 
> On 15 January 2018 at 19:59, Kyrylo Lebediev  > wrote:
> Thanks Alexander!
> 
> I'm not a MS in math too) Unfortunately.
> 
> Not sure, but it seems to me that probability of 2/49 in your explanation 
> doesn't take into account that vnodes endpoints are almost evenly distributed 
> across all nodes (al least it's what I can see from "nodetool ring" output).
> 
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>  
> 
> of course this vnodes illustration is a theoretical one, but there no 2 nodes 
> on that diagram that can be switched off without losing a key range (at 
> CL=QUORUM). 
> 
> That's because vnodes_per_node=8 > Nnodes=6.
> As far as I understand, situation is getting worse with increase of 
> vnodes_per_node/Nnode ratio.
> Please, correct me if I'm wrong.
> 
> How would the situation differ from this example by DataStax, if we had a 
> real-life 6-nodes cluster with 8 vnodes on each node? 
> 
> Regards, 
> Kyrill
> 
> From: Alexander Dejanovski  >
> Sent: Monday, January 15, 2018 8:14:21 PM
> 
> To: user@cassandra.apache.org 
> Subject: Re: vnodes: high availability
>  
> I was corrected off list that the odds of losing data when 2 nodes are down 
> isn't dependent on the number of vnodes, but only on the number of nodes.
> The more vnodes, the smaller the chunks of data you may lose, and vice versa.
> I officially suck at statistics, as expected :)
> 
> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski  > a écrit :
> Hi Kyrylo,
> 
> the situation is a bit more nuanced than shown by the Datastax diagram, which 
> is fairly theoretical.
> If you're using SimpleStrategy, there is no rack awareness. Since vnode 
> distribution is purely random, and the replica for a vnode will be placed on 
> the node that owns the next vnode in token order (yeah, that's not easy to 
> formulate), you end up with statistics only.
> 
> I kinda suck at maths but I'm going to risk making a fool of myself :)
> 
> The odds for one vnode to be replicated on another node are, in your case, 
> 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
> Given you have 256 vnodes, the odds for at least one vnode of a single node 
> to exist on another one is 256*(2/49) = 10.4%
> Since the relationship is bi-directional (there are the same odds for node B 
> to have a vnode replicated on node A than the 

Slender Cassandra Cluster Project

2018-01-16 Thread Kenneth Brotman
I've begun working on a reference project intended to provide guidance on
configuring and operating a modest Cassandra cluster of about 18 nodes
suitable for the economic study, demonstration, experimentation and testing
of a Cassandra cluster.

 

The slender cluster would be designed to be as inexpensive as possible while
still using real world hardware in order to lower the cost to those with
limited initial resources. Sorry no Raspberry Pi's for this project.  

 

There would be an on-premises version and a cloud version.  Guidance would
be provided on configuring the cluster, on demonstrating key Cassandra
behaviors, on files sizes, capacity to use with the Slender Cassandra
Cluster, and so on.

 

Why about eighteen nodes? I tried to figure out what the minimum number of
nodes needed for Cassandra to be Cassandra is?  Here were my considerations:

 

. A user wouldn't run Cassandra in just one data center; so at
least two datacenters.

. A user probably would want a third data center available for
analytics.

. There needs to be enough nodes for enough parallelism to
observe Cassandra's distributed nature.

. The cluster should have enough nodes that one gets a sense of
the need for cluster wide management tools to do things like repairs,
snapshots and cluster monitoring.

. The cluster should be able to demonstrate a RF=3 with local
quorum.  If replicated in all three data centers, one write would impact
half the 18 nodes, 3 datacenters X 3 nodes per data center = 9 nodes of 18
nodes.  If replicated in two of the data centers, one write would still
impact one third of the 18 nodes, 2 DC's X 3 nodes per DC = 6 of the 18
nodes.  

 

So eighteen seems like the minimum number of nodes needed.  That's six nodes
in each of three data centers.

 

Before I get too carried away with this project, I'm looking for some
feedback on whether this project would indeed be helpful to others? Also,
should the project be changed in any way?

 

It's always a pleasure to connect with the Cassandra users' community.
Thanks for all the hard work, the expertise, the civil dialog.

 

Kenneth Brotman



Re: Upgrade to 3.11.1 give SSLv2Hello is disabled error

2018-01-16 Thread Michael Shuler
This looks like the post-POODLE commit:
https://issues.apache.org/jira/browse/CASSANDRA-10508

I think you might just set 'TLS' as in the example to use the JVM's
preferred TLS protocol version.

-- 
Michael

On 01/16/2018 08:13 AM, Tommy Stendahl wrote:
> Hi,
> 
> I have problems upgrading a cluster from 3.0.14 to 3.11.1 but when I
> upgrade the first node it fails to gossip.
> 
> I have server encryption enabled on all nodes with this setting:
> 
> server_encryption_options:
>     internode_encryption: all
>     keystore: /usr/share/cassandra/.ssl/server/keystore.jks
>     keystore_password: 'x'
>     truststore: /usr/share/cassandra/.ssl/server/truststore.jks
>     truststore_password: 'x'
>     protocol: TLSv1.2
>     cipher_suites:
> [TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA]
> 
> 
> I get this error in the log:
> 
> 2018-01-16T14:41:19.671+0100 ERROR [ACCEPT-/10.61.204.16]
> MessagingService.java:1329 SSL handshake error for inbound connection
> from 30f93bf4[SSL_NULL_WITH_NULL_NULL:
> Socket[addr=/x.x.x.x,port=40583,localport=7001]]
> javax.net.ssl.SSLHandshakeException: SSLv2Hello is disabled
>     at
> sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:637)
> ~[na:1.8.0_152]
>     at sun.security.ssl.InputRecord.read(InputRecord.java:527)
> ~[na:1.8.0_152]
>     at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983)
> ~[na:1.8.0_152]
>     at
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
> ~[na:1.8.0_152]
>     at
> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:938)
> ~[na:1.8.0_152]
>     at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
> ~[na:1.8.0_152]
>     at sun.security.ssl.AppInputStream.read(AppInputStream.java:71)
> ~[na:1.8.0_152]
>     at java.io.DataInputStream.readInt(DataInputStream.java:387)
> ~[na:1.8.0_152]
>     at
> org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:1303)
> ~[apache-cassandra-3.11.1.jar:3.11.1]
> 
> I suspect that this has something to do with the change in
> CASSANDRA-10508. Any suggestions on how to get around this would be very
> much appreciated.
> 
> Thanks, /Tommy
> 
> 
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
> 


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: New token allocation and adding a new DC

2018-01-16 Thread Oleksandr Shulgin
On Tue, Jan 16, 2018 at 4:16 PM, Alexander Dejanovski <
a...@thelastpickle.com> wrote:

> Hi Oleksandr,
>
> if bootstrap is disabled, it will only skip the streaming phase but will
> still go through token allocation and thus should use the new algorithm.
> The algorithm won't try to spread data based on size on disk but it will
> try to spread token ownership as evenly as possible.
>
> The problem you'll run into is that ownership for a specific keyspace will
> be null as long as the replication strategy isn't updated to create
> replicas on the new DC.
> Quickly thinking would make me do the following :
>
>- Create enough nodes in the new DC to match the target replication
>factor
>- Alter the replication strategy to add the target number of replicas
>in the new DC (they will start getting writes, and hopefully you've already
>segregated reads)
>- Continue adding nodes in the new DC (with auto_bootstrap = false),
>specifying the right keyspace to optimize token allocations
>- Run rebuild on all nodes in the new DC
>
> I honestly never used it but that's my understanding of how it should work.
>

Oh, that's neat.  We will try this and see if it helps.

Thank you!
--
Alex


Re: Too many tombstones using TTL

2018-01-16 Thread Python_Max
Thanks for a very helpful reply.
Will try to refactor the code accordingly.

On Tue, Jan 16, 2018 at 4:36 PM, Alexander Dejanovski <
a...@thelastpickle.com> wrote:

> I would not plan on deleting data at the row level as you'll end up with a
> lot of tombstones eventually (and you won't even notice them).
> It's not healthy to allow that many tombstones to be read, and while your
> latency may fit your SLA now, it may not in the future.
> Tombstones are going to create a lot of heap pressure and eventually
> trigger long GC pauses, which then tend to affect the whole cluster (a slow
> node is worse than a down node).
>
> You should definitely separate data that is TTLed and data that is not in
> different tables so that you can adjust compaction strategies,
> gc_grace_seconds and read patterns accordingly. I understand that it will
> complexify your code, but it will prevent severe performance issues in
> Cassandra.
>
> Tombstones won't be a problem for repair, they will get repaired as
> classic cells. They negatively affect the read path mostly, and use space
> on disk.
>
> On Tue, Jan 16, 2018 at 2:12 PM Python_Max  wrote:
>
>> Hello.
>>
>> I was planning to remove a row (not partition).
>>
>> Most of the tombstones are seen in the use case of geographic grid with
>> X:Y as partition key and object id (timeuuid) as clustering key where
>> objects could be temporary with TTL about 10 hours or fully persistent.
>> When I select all objects in specific X:Y I can even hit 100k (default)
>> limit for some X:Y. I have changed this limit to 500k since 99.9p read
>> latency is < 75ms so I should not (?) care how many tombstones while read
>> latency is fine.
>>
>> Splitting entities to temporary and permanent and using different
>> compaction strategies is an option but it will lead to code duplication and
>> 2x read queries.
>>
>> Is my assumption correct about tombstones are not so big problem as soon
>> as read latency and disk usage are okey? Are tombstones affect repair time
>> (using reaper)?
>>
>> Thanks.
>>
>>
>> On Tue, Jan 16, 2018 at 11:32 AM, Alexander Dejanovski <
>> a...@thelastpickle.com> wrote:
>>
>>> Hi,
>>>
>>> could you be more specific about the deletes you're planning to perform ?
>>> This will end up moving your problem somewhere else as you'll be
>>> generating new tombstones (and if you're planning on deleting rows, be
>>> aware that row level tombstones aren't reported anywhere in the metrics,
>>> logs and query traces).
>>> Currently you can delete your data at the partition level, which will
>>> create a single tombstone that will shadow all your expired (and non
>>> expired) data and is very efficient. The read path is optimized for such
>>> tombstones and the data won't be fully read from disk nor exchanged between
>>> replicas. But that's of course if your use case allows to delete full
>>> partitions.
>>>
>>> We usually model so that we can restrict our reads to live data.
>>> If you're creating time series, your clustering key should include a
>>> timestamp, which you can use to avoid reading expired data. If your TTL is
>>> set to 60 days, you can read only data that is strictly younger than that.
>>> Then you can partition by time ranges, and access exclusively partitions
>>> that have no chance to be expired yet.
>>> Those techniques usually work better with TWCS, but the former could
>>> make you hit a lot of SSTables if your partitions can spread over all time
>>> buckets, so only use TWCS if you can restrict individual reads to up to 4
>>> time windows.
>>>
>>> Cheers,
>>>
>>>
>>> On Tue, Jan 16, 2018 at 10:01 AM Python_Max 
>>> wrote:
>>>
 Hi.

 Thank you very much for detailed explanation.
 Seems that there is nothing I can do about it except delete records by
 key instead of expiring.


 On Fri, Jan 12, 2018 at 7:30 PM, Alexander Dejanovski <
 a...@thelastpickle.com> wrote:

> Hi,
>
> As DuyHai said, different TTLs could theoretically be set for
> different cells of the same row. And one TTLed cell could be shadowing
> another cell that has no TTL (say you forgot to set a TTL and set one
> afterwards by performing an update), or vice versa.
> One cell could also be missing from a node without Cassandra knowing.
> So turning an incomplete row that only has expired cells into a tombstone
> row could lead to wrong results being returned at read time : the 
> tombstone
> row could potentially shadow a valid live cell from another replica.
>
> Cassandra needs to retain each TTLed cell and send it to replicas
> during reads to cover all possible cases.
>
>
> On Fri, Jan 12, 2018 at 5:28 PM Python_Max 
> wrote:
>
>> Thank you for response.
>>
>> I know about the option of setting TTL per column or even per item in
>> collection. However in my example entire row has expired, 

Re: New token allocation and adding a new DC

2018-01-16 Thread Alexander Dejanovski
Hi Oleksandr,

if bootstrap is disabled, it will only skip the streaming phase but will
still go through token allocation and thus should use the new algorithm.
The algorithm won't try to spread data based on size on disk but it will
try to spread token ownership as evenly as possible.

The problem you'll run into is that ownership for a specific keyspace will
be null as long as the replication strategy isn't updated to create
replicas on the new DC.
Quickly thinking would make me do the following :

   - Create enough nodes in the new DC to match the target replication
   factor
   - Alter the replication strategy to add the target number of replicas in
   the new DC (they will start getting writes, and hopefully you've already
   segregated reads)
   - Continue adding nodes in the new DC (with auto_bootstrap = false),
   specifying the right keyspace to optimize token allocations
   - Run rebuild on all nodes in the new DC

I honestly never used it but that's my understanding of how it should work.

Cheers,


On Tue, Jan 16, 2018 at 3:51 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> Hello,
>
> We want to add a new rack to an existing cluster (a new Availability Zone
> on AWS).
>
> Currently we have 12 nodes in 2 racks with ~4 TB data per node.  We also
> want to have bigger number of smaller nodes.  In order to minimize the
> streaming we want to add a new DC which will span 3 racks and then
> decommission the old DC.
>
> Following the documented procedure we are going to create all nodes in the
> new DC with auto_bootstrap=false and a distinct dc_suffix.  Then we are
> going to run `nodetool rebuild OLD_DC` on every node.
>
> Since we are observing some uneven load distribution in the old DC, we
> wanted to make use of new token allocation algorithm of Cassandra 3.0+ when
> building the new DC.
>
> To our understanding, this is currently not supported, because the new
> algorithm can only be used during proper node bootstrap?
>
> In theory it should still be possible to allocate tokens in the new DC by
> telling Cassandra which keyspace to optimize for and from which remote DC
> the data will be streamed ultimately, or am I missing something?
>
> Reading through the original implementation ticket I didn't find any
> reference to interaction with rebuild:
> https://issues.apache.org/jira/browse/CASSANDRA-7032
> Nor do I find any open tickets that would discuss the topic.
>
> Is it reasonable to open an issue for that or is there some obvious
> blocker?
>
> Thanks,
> --
> Alex
>
>

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


New token allocation and adding a new DC

2018-01-16 Thread Oleksandr Shulgin
Hello,

We want to add a new rack to an existing cluster (a new Availability Zone
on AWS).

Currently we have 12 nodes in 2 racks with ~4 TB data per node.  We also
want to have bigger number of smaller nodes.  In order to minimize the
streaming we want to add a new DC which will span 3 racks and then
decommission the old DC.

Following the documented procedure we are going to create all nodes in the
new DC with auto_bootstrap=false and a distinct dc_suffix.  Then we are
going to run `nodetool rebuild OLD_DC` on every node.

Since we are observing some uneven load distribution in the old DC, we
wanted to make use of new token allocation algorithm of Cassandra 3.0+ when
building the new DC.

To our understanding, this is currently not supported, because the new
algorithm can only be used during proper node bootstrap?

In theory it should still be possible to allocate tokens in the new DC by
telling Cassandra which keyspace to optimize for and from which remote DC
the data will be streamed ultimately, or am I missing something?

Reading through the original implementation ticket I didn't find any
reference to interaction with rebuild:
https://issues.apache.org/jira/browse/CASSANDRA-7032
Nor do I find any open tickets that would discuss the topic.

Is it reasonable to open an issue for that or is there some obvious blocker?

Thanks,
-- 
Alex


Re: Too many tombstones using TTL

2018-01-16 Thread Alexander Dejanovski
I would not plan on deleting data at the row level as you'll end up with a
lot of tombstones eventually (and you won't even notice them).
It's not healthy to allow that many tombstones to be read, and while your
latency may fit your SLA now, it may not in the future.
Tombstones are going to create a lot of heap pressure and eventually
trigger long GC pauses, which then tend to affect the whole cluster (a slow
node is worse than a down node).

You should definitely separate data that is TTLed and data that is not in
different tables so that you can adjust compaction strategies,
gc_grace_seconds and read patterns accordingly. I understand that it will
complexify your code, but it will prevent severe performance issues in
Cassandra.

Tombstones won't be a problem for repair, they will get repaired as classic
cells. They negatively affect the read path mostly, and use space on disk.

On Tue, Jan 16, 2018 at 2:12 PM Python_Max  wrote:

> Hello.
>
> I was planning to remove a row (not partition).
>
> Most of the tombstones are seen in the use case of geographic grid with
> X:Y as partition key and object id (timeuuid) as clustering key where
> objects could be temporary with TTL about 10 hours or fully persistent.
> When I select all objects in specific X:Y I can even hit 100k (default)
> limit for some X:Y. I have changed this limit to 500k since 99.9p read
> latency is < 75ms so I should not (?) care how many tombstones while read
> latency is fine.
>
> Splitting entities to temporary and permanent and using different
> compaction strategies is an option but it will lead to code duplication and
> 2x read queries.
>
> Is my assumption correct about tombstones are not so big problem as soon
> as read latency and disk usage are okey? Are tombstones affect repair time
> (using reaper)?
>
> Thanks.
>
>
> On Tue, Jan 16, 2018 at 11:32 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi,
>>
>> could you be more specific about the deletes you're planning to perform ?
>> This will end up moving your problem somewhere else as you'll be
>> generating new tombstones (and if you're planning on deleting rows, be
>> aware that row level tombstones aren't reported anywhere in the metrics,
>> logs and query traces).
>> Currently you can delete your data at the partition level, which will
>> create a single tombstone that will shadow all your expired (and non
>> expired) data and is very efficient. The read path is optimized for such
>> tombstones and the data won't be fully read from disk nor exchanged between
>> replicas. But that's of course if your use case allows to delete full
>> partitions.
>>
>> We usually model so that we can restrict our reads to live data.
>> If you're creating time series, your clustering key should include a
>> timestamp, which you can use to avoid reading expired data. If your TTL is
>> set to 60 days, you can read only data that is strictly younger than that.
>> Then you can partition by time ranges, and access exclusively partitions
>> that have no chance to be expired yet.
>> Those techniques usually work better with TWCS, but the former could make
>> you hit a lot of SSTables if your partitions can spread over all time
>> buckets, so only use TWCS if you can restrict individual reads to up to 4
>> time windows.
>>
>> Cheers,
>>
>>
>> On Tue, Jan 16, 2018 at 10:01 AM Python_Max  wrote:
>>
>>> Hi.
>>>
>>> Thank you very much for detailed explanation.
>>> Seems that there is nothing I can do about it except delete records by
>>> key instead of expiring.
>>>
>>>
>>> On Fri, Jan 12, 2018 at 7:30 PM, Alexander Dejanovski <
>>> a...@thelastpickle.com> wrote:
>>>
 Hi,

 As DuyHai said, different TTLs could theoretically be set for different
 cells of the same row. And one TTLed cell could be shadowing another cell
 that has no TTL (say you forgot to set a TTL and set one afterwards by
 performing an update), or vice versa.
 One cell could also be missing from a node without Cassandra knowing.
 So turning an incomplete row that only has expired cells into a tombstone
 row could lead to wrong results being returned at read time : the tombstone
 row could potentially shadow a valid live cell from another replica.

 Cassandra needs to retain each TTLed cell and send it to replicas
 during reads to cover all possible cases.


 On Fri, Jan 12, 2018 at 5:28 PM Python_Max 
 wrote:

> Thank you for response.
>
> I know about the option of setting TTL per column or even per item in
> collection. However in my example entire row has expired, shouldn't
> Cassandra be able to detect this situation and spawn a single tombstone 
> for
> entire row instead of many?
> Is there any reason not doing this except that no one needs it? Is
> this suitable for feature request or improvement?
>
> Thanks.
>
> On Wed, Jan 

Upgrade to 3.11.1 give SSLv2Hello is disabled error

2018-01-16 Thread Tommy Stendahl

Hi,

I have problems upgrading a cluster from 3.0.14 to 3.11.1 but when I 
upgrade the first node it fails to gossip.


I have server encryption enabled on all nodes with this setting:

server_encryption_options:
    internode_encryption: all
    keystore: /usr/share/cassandra/.ssl/server/keystore.jks
    keystore_password: 'x'
    truststore: /usr/share/cassandra/.ssl/server/truststore.jks
    truststore_password: 'x'
    protocol: TLSv1.2
    cipher_suites: 
[TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_RSA_WITH_AES_128_CBC_SHA]


I get this error in the log:

2018-01-16T14:41:19.671+0100 ERROR [ACCEPT-/10.61.204.16] 
MessagingService.java:1329 SSL handshake error for inbound connection 
from 30f93bf4[SSL_NULL_WITH_NULL_NULL: 
Socket[addr=/x.x.x.x,port=40583,localport=7001]]

javax.net.ssl.SSLHandshakeException: SSLv2Hello is disabled
    at 
sun.security.ssl.InputRecord.handleUnknownRecord(InputRecord.java:637) 
~[na:1.8.0_152]
    at sun.security.ssl.InputRecord.read(InputRecord.java:527) 
~[na:1.8.0_152]
    at 
sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:983) 
~[na:1.8.0_152]
    at 
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385) 
~[na:1.8.0_152]
    at 
sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:938) 
~[na:1.8.0_152]
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105) 
~[na:1.8.0_152]
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:71) 
~[na:1.8.0_152]
    at java.io.DataInputStream.readInt(DataInputStream.java:387) 
~[na:1.8.0_152]
    at 
org.apache.cassandra.net.MessagingService$SocketThread.run(MessagingService.java:1303) 
~[apache-cassandra-3.11.1.jar:3.11.1]


I suspect that this has something to do with the change in 
CASSANDRA-10508. Any suggestions on how to get around this would be very 
much appreciated.


Thanks, /Tommy



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Too many tombstones using TTL

2018-01-16 Thread Python_Max
Hello.

I was planning to remove a row (not partition).

Most of the tombstones are seen in the use case of geographic grid with X:Y
as partition key and object id (timeuuid) as clustering key where objects
could be temporary with TTL about 10 hours or fully persistent.
When I select all objects in specific X:Y I can even hit 100k (default)
limit for some X:Y. I have changed this limit to 500k since 99.9p read
latency is < 75ms so I should not (?) care how many tombstones while read
latency is fine.

Splitting entities to temporary and permanent and using different
compaction strategies is an option but it will lead to code duplication and
2x read queries.

Is my assumption correct about tombstones are not so big problem as soon as
read latency and disk usage are okey? Are tombstones affect repair time
(using reaper)?

Thanks.


On Tue, Jan 16, 2018 at 11:32 AM, Alexander Dejanovski <
a...@thelastpickle.com> wrote:

> Hi,
>
> could you be more specific about the deletes you're planning to perform ?
> This will end up moving your problem somewhere else as you'll be
> generating new tombstones (and if you're planning on deleting rows, be
> aware that row level tombstones aren't reported anywhere in the metrics,
> logs and query traces).
> Currently you can delete your data at the partition level, which will
> create a single tombstone that will shadow all your expired (and non
> expired) data and is very efficient. The read path is optimized for such
> tombstones and the data won't be fully read from disk nor exchanged between
> replicas. But that's of course if your use case allows to delete full
> partitions.
>
> We usually model so that we can restrict our reads to live data.
> If you're creating time series, your clustering key should include a
> timestamp, which you can use to avoid reading expired data. If your TTL is
> set to 60 days, you can read only data that is strictly younger than that.
> Then you can partition by time ranges, and access exclusively partitions
> that have no chance to be expired yet.
> Those techniques usually work better with TWCS, but the former could make
> you hit a lot of SSTables if your partitions can spread over all time
> buckets, so only use TWCS if you can restrict individual reads to up to 4
> time windows.
>
> Cheers,
>
>
> On Tue, Jan 16, 2018 at 10:01 AM Python_Max  wrote:
>
>> Hi.
>>
>> Thank you very much for detailed explanation.
>> Seems that there is nothing I can do about it except delete records by
>> key instead of expiring.
>>
>>
>> On Fri, Jan 12, 2018 at 7:30 PM, Alexander Dejanovski <
>> a...@thelastpickle.com> wrote:
>>
>>> Hi,
>>>
>>> As DuyHai said, different TTLs could theoretically be set for different
>>> cells of the same row. And one TTLed cell could be shadowing another cell
>>> that has no TTL (say you forgot to set a TTL and set one afterwards by
>>> performing an update), or vice versa.
>>> One cell could also be missing from a node without Cassandra knowing. So
>>> turning an incomplete row that only has expired cells into a tombstone row
>>> could lead to wrong results being returned at read time : the tombstone row
>>> could potentially shadow a valid live cell from another replica.
>>>
>>> Cassandra needs to retain each TTLed cell and send it to replicas during
>>> reads to cover all possible cases.
>>>
>>>
>>> On Fri, Jan 12, 2018 at 5:28 PM Python_Max  wrote:
>>>
 Thank you for response.

 I know about the option of setting TTL per column or even per item in
 collection. However in my example entire row has expired, shouldn't
 Cassandra be able to detect this situation and spawn a single tombstone for
 entire row instead of many?
 Is there any reason not doing this except that no one needs it? Is this
 suitable for feature request or improvement?

 Thanks.

 On Wed, Jan 10, 2018 at 4:52 PM, DuyHai Doan 
 wrote:

> "The question is why Cassandra creates a tombstone for every column
> instead of single tombstone per row?"
>
> --> Simply because technically it is possible to set different TTL
> value on each column of a CQL row
>
> On Wed, Jan 10, 2018 at 2:59 PM, Python_Max 
> wrote:
>
>> Hello, C* users and experts.
>>
>> I have (one more) question about tombstones.
>>
>> Consider the following example:
>> cqlsh> create keyspace test_ttl with replication = {'class':
>> 'SimpleStrategy', 'replication_factor': '1'}; use test_ttl;
>> cqlsh> create table items(a text, b text, c1 text, c2 text, c3 text,
>> primary key (a, b));
>> cqlsh> insert into items(a,b,c1,c2,c3) values('AAA', 'BBB', 'C111',
>> 'C222', 'C333') using ttl 60;
>> bash$ nodetool flush
>> bash$ sleep 60
>> bash$ nodetool compact test_ttl items
>> bash$ sstabledump mc-2-big-Data.db
>>
>> [
>>   {
>> "partition" : {

Re: vnodes: high availability

2018-01-16 Thread Kyrylo Lebediev
Alex, thanks for detailed explanation.


> The more nodes you have, the smaller will be the subset of data that cannot 
> achieve quorum (so your outage is not as bad as when you have a small number 
> of nodes

Okay, let's say we lost 0.5% of keyrange [for specified CL]. Critically 
important data chunks may fall into this range.


>If you want more availability but don't want to sacrifice consistency, you can 
>raise your replication factor (if you can afford the extra disk space usage)


IMO, ~1.6x more disk space (5/3) isn't the most important drawback in this 
case.. For example, with increasing RF we could affect latency for queries with 
CL=QUORUM (will have to wait for response from 3 nodes for RF=5 instead of 
waiting for 2 nodes for RF=3). Plus, overhead for most of maintenance 
operations like anti-entropy repairs should increase with increasing of RF


>Datacenter and rack awareness built in Cassandra can help with availability 
>guarantees : 1 full rack down out of 3 will still allow QUORUM at RF=3

At the same time if we get any 2 nodes  from different racks down in a time (in 
case of high vnodes values), a keyrange becomes unavailable, as far as I 
understand. [I stick to CL=QUORUM just for brevity sake. For CL=ONE the issue 
is similar]


[In the the worst case scenario of any C* setup (with or without vnodes) if we 
lose two 'neighboring' nodes, we get outage for a keyrange for CL=QUORUM 
http://thelastpickle.com/blog/2011/06/13/Down-For-Me.html ]


--

I'm not trying criticize C* architecture which gives us good features like 
linear scalability, but I feel that some math should be done in order to  to 
elaborate setup best-practices on how to maximize availability for C* clusters 
(there are a number of best-practices, including those mentioned by Alex, but 
personally sometimes I can't see which math is behind them)


If anybody in the DL has some math prepared, please, share wit us. I guess, not 
only me is interested in getting these valuable formulas/graphs.


Thanks,

Kyrill

[http://thelastpickle.com/android-chrome-192x192.png]

Down For Me? - The Last 
Pickle
thelastpickle.com
For a read or write request to start in Cassandra at least as many nodes must 
be seen as UP by the coordinator node as the request has specified via the...




From: Alexander Dejanovski 
Sent: Tuesday, January 16, 2018 12:50:13 PM
To: user@cassandra.apache.org
Subject: Re: vnodes: high availability

Hi Kyrylo,

high availability can be interpreted in many ways, and comes with some 
tradeoffs with consistency when things go wrong.
A few considerations here :

  *   The more nodes you have, the smaller will be the subset of data that 
cannot achieve quorum (so your outage is not as bad as when you have a small 
number of nodes)
  *   If you want more availability but don't want to sacrifice consistency, 
you can raise your replication factor (if you can afford the extra disk space 
usage)
  *   Datacenter and rack awareness built in Cassandra can help with 
availability guarantees : 1 full rack down out of 3 will still allow QUORUM at 
RF=3 and 2 racks down out of 5 at RF=5. Having one datacenter down (when using 
LOCAL_QUORUM) allows you to switch to another one and still have a working 
cluster.
  *   As mentioned in this thread, you can use downgrading retry policies to 
improve availability at the transient expense of consistency (check if your use 
case allows it)

Now about vnodes, the recommendation of using 256 is based on statistical 
analysis of data balance across clusters. Since the token allocation is fully 
random, it's been observed that 256 vnodes always gave a good balance.
If you're using a version of Cassandra >= 3.0, you can lower that to a value 
between either 16 or 32 and use the new token allocation algorithm. It will 
perform several attempts in order to balance a specific keyspace during 
bootstrap.
Using smaller numbers of vnodes will also improve repair time.
I won't go into statistics again (yikes) and leave it to people that are better 
at doing maths on how the number of vnodes per node could affect availability.

That brings us to the fact that you can fully disable vnodes and use a single 
token per node. In that case, you can be sure which nodes are replicas of the 
same tokens as it follows the ring order : With RF=3, node A tokens are 
replicated on nodes B and C, and node B tokens are replicated on nodes C and D, 
and so on.
You get more predictability as to which nodes can be taken down at the same 
time without losing QUORUM.
But you must afford the operational burden of handling tokens manually, and 
accept that growing the cluster means doubling the size each time.

The thing to consider is how your apps/services will react in case of transient 
loss of QUORUM : can you afford eventual 

Re: TWCS and autocompaction

2018-01-16 Thread Alexander Dejanovski
Hi,

The overlaps you're seeing on time windows aren't due to automatic
compactions, but to read repairs.
You must be reading at quorum or local_quorum which can perform foreground
read repair in case of digest mismatch.

You can set unchecked_tombstone_compaction to true if you want to perform
single sstable compaction to purge tombstones and a patch has recently been
merged in to allow twcs to delete fully expired data even in case of
overlap between time windows (I can't remember if it's been merged in
3.11.1).
Just so you know, the timestamp considered for time windows is the max
timestamp. You can have old data in recent time windows, but not the
opposite.

Cheers,

Le mar. 16 janv. 2018 à 12:07, Cogumelos Maravilha <
cogumelosmaravi...@sapo.pt> a écrit :

> Hi list,
>
> My settings:
>
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_size': '4', 'compaction_window_unit': 'HOURS',
> 'enabled': 'true', 'max_threshold': '64', 'min_threshold': '2',
> 'tombstone_compaction_interval': '15000', 'tombstone_threshold': '0.2',
> 'unchecked_tombstone_compaction': 'false'}
> AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND crc_check_chance = 0.0
> AND dclocal_read_repair_chance = 0.0
> AND default_time_to_live = 1555200
> AND gc_grace_seconds = 10800
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99PERCENTILE';
>
> Running this script:
>
> for f in *Data.db; do
>ls -lrt $f
>output=$(sstablemetadata $f 2>/dev/null)
>max=$(echo "$output" | grep Maximum\ timestamp | cut -d" " -f3 | cut
> -c 1-10)
>min=$(echo "$output" | grep Minimum\ timestamp | cut -d" " -f3 | cut
> -c 1-10)
>date -d @$max +'%d/%m/%Y %H:%M:%S'
>date -d @$min +'%d/%m/%Y %H:%M:%S'
> done
>
> on sstables I'm getting values like these:
>
> -rw-r--r-- 1 cassandra cassandra 12137573577 Jan 14 20:08
> mc-22750-big-Data.db
> 14/01/2018 19:57:41
> 31/12/2017 19:06:48
>
> -rw-r--r-- 1 cassandra cassandra 4669422106 Jan 14 06:55
> mc-22322-big-Data.db
> 12/01/2018 07:59:57
> 28/12/2017 19:08:42
>
> My goal is using TWCS for sstables expired fast because lots of new data
> is coming in. What is the best approach to archive that? Should I
> disable auto compaction?
> Thanks in advance.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


TWCS and autocompaction

2018-01-16 Thread Cogumelos Maravilha
Hi list,

My settings:

AND compaction = {'class':
'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
'compaction_window_size': '4', 'compaction_window_unit': 'HOURS',
'enabled': 'true', 'max_threshold': '64', 'min_threshold': '2',
'tombstone_compaction_interval': '15000', 'tombstone_threshold': '0.2',
'unchecked_tombstone_compaction': 'false'}
    AND compression = {'chunk_length_in_kb': '64', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 0.0
    AND dclocal_read_repair_chance = 0.0
    AND default_time_to_live = 1555200
    AND gc_grace_seconds = 10800
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

Running this script:

for f in *Data.db; do
   ls -lrt $f
   output=$(sstablemetadata $f 2>/dev/null)
   max=$(echo "$output" | grep Maximum\ timestamp | cut -d" " -f3 | cut
-c 1-10)
   min=$(echo "$output" | grep Minimum\ timestamp | cut -d" " -f3 | cut
-c 1-10)
   date -d @$max +'%d/%m/%Y %H:%M:%S'
   date -d @$min +'%d/%m/%Y %H:%M:%S'
done

on sstables I'm getting values like these:

-rw-r--r-- 1 cassandra cassandra 12137573577 Jan 14 20:08
mc-22750-big-Data.db
14/01/2018 19:57:41
31/12/2017 19:06:48

-rw-r--r-- 1 cassandra cassandra 4669422106 Jan 14 06:55
mc-22322-big-Data.db
12/01/2018 07:59:57
28/12/2017 19:08:42

My goal is using TWCS for sstables expired fast because lots of new data
is coming in. What is the best approach to archive that? Should I
disable auto compaction?
Thanks in advance.


-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: vnodes: high availability

2018-01-16 Thread Alexander Dejanovski
Hi Kyrylo,

high availability can be interpreted in many ways, and comes with some
tradeoffs with consistency when things go wrong.
A few considerations here :

   - The more nodes you have, the smaller will be the subset of data that
   cannot achieve quorum (so your outage is not as bad as when you have a
   small number of nodes)
   - If you want more availability but don't want to sacrifice consistency,
   you can raise your replication factor (if you can afford the extra disk
   space usage)
   - Datacenter and rack awareness built in Cassandra can help with
   availability guarantees : 1 full rack down out of 3 will still allow QUORUM
   at RF=3 and 2 racks down out of 5 at RF=5. Having one datacenter down (when
   using LOCAL_QUORUM) allows you to switch to another one and still have a
   working cluster.
   - As mentioned in this thread, you can use downgrading retry policies to
   improve availability at the transient expense of consistency (check if your
   use case allows it)

Now about vnodes, the recommendation of using 256 is based on statistical
analysis of data balance across clusters. Since the token allocation is
fully random, it's been observed that 256 vnodes always gave a good balance.
If you're using a version of Cassandra >= 3.0, you can lower that to a
value between either 16 or 32 and use the new token allocation algorithm.
It will perform several attempts in order to balance a specific keyspace
during bootstrap.
Using smaller numbers of vnodes will also improve repair time.
I won't go into statistics again (yikes) and leave it to people that are
better at doing maths on how the number of vnodes per node could affect
availability.

That brings us to the fact that you can fully disable vnodes and use a
single token per node. In that case, you can be sure which nodes are
replicas of the same tokens as it follows the ring order : With RF=3, node
A tokens are replicated on nodes B and C, and node B tokens are replicated
on nodes C and D, and so on.
You get more predictability as to which nodes can be taken down at the same
time without losing QUORUM.
But you must afford the operational burden of handling tokens manually, and
accept that growing the cluster means doubling the size each time.

The thing to consider is how your apps/services will react in case of
transient loss of QUORUM : can you afford eventual consistency ? Is it
better to endure full downtime or just on a subset of your partitions ?
And can you design your cluster with racks/datacenters so that you can
better predict how to run maintenance operations or if you may be losing
QUORUM ?

The way Cassandra is designed also allows linear scalability, which
master/slave based databases cannot handle (and master/slave architectures
come with their set of challenges, especially during network partitions).

So, while the high availability isn't as transparent as one might think
(and I understand why you may be disappointed), you have a lot of options
on how to react to partial downtime, and that's something you must consider
both when designing your cluster (how it is segmented, how operations are
performed), and when designing your apps (how you will use the driver, how
your apps will react to failure).

Cheers,


On Tue, Jan 16, 2018 at 11:03 AM Kyrylo Lebediev 
wrote:

> ...to me it sounds like 'C* isn't that highly-available by design as it's
> declared'.
>
> More nodes in a cluster means higher probability of simultaneous node
> failures.
>
> And from high-availability standpoint, looks like situation is made even
> worse by recommended setting vnodes=256.
>
>
> Need to do some math to get numbers/formulas, but now situation doesn't
> seem to be promising.
>
> In case smb from C* developers/architects is reading this message, I'd be
> grateful to get some links to calculations of C* reliability based on which
> decisions were made.
>
>
> Regards,
>
> Kyrill
> --
> *From:* kurt greaves 
> *Sent:* Tuesday, January 16, 2018 2:16:34 AM
> *To:* User
>
> *Subject:* Re: vnodes: high availability
> Yeah it's very unlikely that you will have 2 nodes in the cluster with NO
> intersecting token ranges (vnodes) for an RF of 3 (probably even 2).
>
> If node A goes down all 256 ranges will go down, and considering there are
> only 49 other nodes all with 256 vnodes each, it's very likely that every
> node will be responsible for some range A was also responsible for. I'm not
> sure what the exact math is, but think of it this way: If on each node, any
> of its 256 token ranges overlap (it's within the next RF-1 or previous RF-1
> token ranges) on the ring with a token range on node A those token ranges
> will be down at QUORUM.
>
> Because token range assignment just uses rand() under the hood, I'm sure
> you could prove that it's always going to be the case that any 2 nodes
> going down result in a loss of QUORUM for some token range.
>
> On 15 January 2018 at 

Re: vnodes: high availability

2018-01-16 Thread Kyrylo Lebediev
...to me it sounds like 'C* isn't that highly-available by design as it's 
declared'.

More nodes in a cluster means higher probability of simultaneous node failures.

And from high-availability standpoint, looks like situation is made even worse 
by recommended setting vnodes=256.


Need to do some math to get numbers/formulas, but now situation doesn't seem to 
be promising.

In case smb from C* developers/architects is reading this message, I'd be 
grateful to get some links to calculations of C* reliability based on which 
decisions were made.


Regards,

Kyrill


From: kurt greaves 
Sent: Tuesday, January 16, 2018 2:16:34 AM
To: User
Subject: Re: vnodes: high availability

Yeah it's very unlikely that you will have 2 nodes in the cluster with NO 
intersecting token ranges (vnodes) for an RF of 3 (probably even 2).

If node A goes down all 256 ranges will go down, and considering there are only 
49 other nodes all with 256 vnodes each, it's very likely that every node will 
be responsible for some range A was also responsible for. I'm not sure what the 
exact math is, but think of it this way: If on each node, any of its 256 token 
ranges overlap (it's within the next RF-1 or previous RF-1 token ranges) on the 
ring with a token range on node A those token ranges will be down at QUORUM.

Because token range assignment just uses rand() under the hood, I'm sure you 
could prove that it's always going to be the case that any 2 nodes going down 
result in a loss of QUORUM for some token range.

On 15 January 2018 at 19:59, Kyrylo Lebediev 
> wrote:

Thanks Alexander!


I'm not a MS in math too) Unfortunately.


Not sure, but it seems to me that probability of 2/49 in your explanation 
doesn't take into account that vnodes endpoints are almost evenly distributed 
across all nodes (al least it's what I can see from "nodetool ring" output).


http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
of course this vnodes illustration is a theoretical one, but there no 2 nodes 
on that diagram that can be switched off without losing a key range (at 
CL=QUORUM).


That's because vnodes_per_node=8 > Nnodes=6.

As far as I understand, situation is getting worse with increase of 
vnodes_per_node/Nnode ratio.

Please, correct me if I'm wrong.


How would the situation differ from this example by DataStax, if we had a 
real-life 6-nodes cluster with 8 vnodes on each node?


Regards,

Kyrill



From: Alexander Dejanovski 
>
Sent: Monday, January 15, 2018 8:14:21 PM

To: user@cassandra.apache.org
Subject: Re: vnodes: high availability


I was corrected off list that the odds of losing data when 2 nodes are down 
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski 
> a écrit :
Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram, which 
is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode 
distribution is purely random, and the replica for a vnode will be placed on 
the node that owns the next vnode in token order (yeah, that's not easy to 
formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case, 2/49 
(out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node to 
exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node B to 
have a vnode replicated on node A than the opposite), that doubles the odds of 
2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having more 
nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in that 
area...)

How many queries that will affect is a different question as it depends on 
which partition currently exist and are queried in the unavailable token ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of 
racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in a 
single rack, you will still have two other replicas available to satisfy quorum 
in the remaining racks.
If you start losing nodes in different racks, we're back 

Re: Too many tombstones using TTL

2018-01-16 Thread Alexander Dejanovski
Hi,

could you be more specific about the deletes you're planning to perform ?
This will end up moving your problem somewhere else as you'll be generating
new tombstones (and if you're planning on deleting rows, be aware that row
level tombstones aren't reported anywhere in the metrics, logs and query
traces).
Currently you can delete your data at the partition level, which will
create a single tombstone that will shadow all your expired (and non
expired) data and is very efficient. The read path is optimized for such
tombstones and the data won't be fully read from disk nor exchanged between
replicas. But that's of course if your use case allows to delete full
partitions.

We usually model so that we can restrict our reads to live data.
If you're creating time series, your clustering key should include a
timestamp, which you can use to avoid reading expired data. If your TTL is
set to 60 days, you can read only data that is strictly younger than that.
Then you can partition by time ranges, and access exclusively partitions
that have no chance to be expired yet.
Those techniques usually work better with TWCS, but the former could make
you hit a lot of SSTables if your partitions can spread over all time
buckets, so only use TWCS if you can restrict individual reads to up to 4
time windows.

Cheers,


On Tue, Jan 16, 2018 at 10:01 AM Python_Max  wrote:

> Hi.
>
> Thank you very much for detailed explanation.
> Seems that there is nothing I can do about it except delete records by key
> instead of expiring.
>
>
> On Fri, Jan 12, 2018 at 7:30 PM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi,
>>
>> As DuyHai said, different TTLs could theoretically be set for different
>> cells of the same row. And one TTLed cell could be shadowing another cell
>> that has no TTL (say you forgot to set a TTL and set one afterwards by
>> performing an update), or vice versa.
>> One cell could also be missing from a node without Cassandra knowing. So
>> turning an incomplete row that only has expired cells into a tombstone row
>> could lead to wrong results being returned at read time : the tombstone row
>> could potentially shadow a valid live cell from another replica.
>>
>> Cassandra needs to retain each TTLed cell and send it to replicas during
>> reads to cover all possible cases.
>>
>>
>> On Fri, Jan 12, 2018 at 5:28 PM Python_Max  wrote:
>>
>>> Thank you for response.
>>>
>>> I know about the option of setting TTL per column or even per item in
>>> collection. However in my example entire row has expired, shouldn't
>>> Cassandra be able to detect this situation and spawn a single tombstone for
>>> entire row instead of many?
>>> Is there any reason not doing this except that no one needs it? Is this
>>> suitable for feature request or improvement?
>>>
>>> Thanks.
>>>
>>> On Wed, Jan 10, 2018 at 4:52 PM, DuyHai Doan 
>>> wrote:
>>>
 "The question is why Cassandra creates a tombstone for every column
 instead of single tombstone per row?"

 --> Simply because technically it is possible to set different TTL
 value on each column of a CQL row

 On Wed, Jan 10, 2018 at 2:59 PM, Python_Max 
 wrote:

> Hello, C* users and experts.
>
> I have (one more) question about tombstones.
>
> Consider the following example:
> cqlsh> create keyspace test_ttl with replication = {'class':
> 'SimpleStrategy', 'replication_factor': '1'}; use test_ttl;
> cqlsh> create table items(a text, b text, c1 text, c2 text, c3 text,
> primary key (a, b));
> cqlsh> insert into items(a,b,c1,c2,c3) values('AAA', 'BBB', 'C111',
> 'C222', 'C333') using ttl 60;
> bash$ nodetool flush
> bash$ sleep 60
> bash$ nodetool compact test_ttl items
> bash$ sstabledump mc-2-big-Data.db
>
> [
>   {
> "partition" : {
>   "key" : [ "AAA" ],
>   "position" : 0
> },
> "rows" : [
>   {
> "type" : "row",
> "position" : 58,
> "clustering" : [ "BBB" ],
> "liveness_info" : { "tstamp" : "2018-01-10T13:29:25.777Z",
> "ttl" : 60, "expires_at" : "2018-01-10T13:30:25Z", "expired" : true },
> "cells" : [
>   { "name" : "c1", "deletion_info" : { "local_delete_time" :
> "2018-01-10T13:29:25Z" }
>   },
>   { "name" : "c2", "deletion_info" : { "local_delete_time" :
> "2018-01-10T13:29:25Z" }
>   },
>   { "name" : "c3", "deletion_info" : { "local_delete_time" :
> "2018-01-10T13:29:25Z" }
>   }
> ]
>   }
> ]
>   }
> ]
>
> The question is why Cassandra creates a tombstone for every column
> instead of single tombstone per row?
>
> In production environment I have a table with ~30 columns and It gives
> me a warning for 30k 

Re: Too many tombstones using TTL

2018-01-16 Thread Python_Max
Hi.

Thank you very much for detailed explanation.
Seems that there is nothing I can do about it except delete records by key
instead of expiring.


On Fri, Jan 12, 2018 at 7:30 PM, Alexander Dejanovski <
a...@thelastpickle.com> wrote:

> Hi,
>
> As DuyHai said, different TTLs could theoretically be set for different
> cells of the same row. And one TTLed cell could be shadowing another cell
> that has no TTL (say you forgot to set a TTL and set one afterwards by
> performing an update), or vice versa.
> One cell could also be missing from a node without Cassandra knowing. So
> turning an incomplete row that only has expired cells into a tombstone row
> could lead to wrong results being returned at read time : the tombstone row
> could potentially shadow a valid live cell from another replica.
>
> Cassandra needs to retain each TTLed cell and send it to replicas during
> reads to cover all possible cases.
>
>
> On Fri, Jan 12, 2018 at 5:28 PM Python_Max  wrote:
>
>> Thank you for response.
>>
>> I know about the option of setting TTL per column or even per item in
>> collection. However in my example entire row has expired, shouldn't
>> Cassandra be able to detect this situation and spawn a single tombstone for
>> entire row instead of many?
>> Is there any reason not doing this except that no one needs it? Is this
>> suitable for feature request or improvement?
>>
>> Thanks.
>>
>> On Wed, Jan 10, 2018 at 4:52 PM, DuyHai Doan 
>> wrote:
>>
>>> "The question is why Cassandra creates a tombstone for every column
>>> instead of single tombstone per row?"
>>>
>>> --> Simply because technically it is possible to set different TTL value
>>> on each column of a CQL row
>>>
>>> On Wed, Jan 10, 2018 at 2:59 PM, Python_Max 
>>> wrote:
>>>
 Hello, C* users and experts.

 I have (one more) question about tombstones.

 Consider the following example:
 cqlsh> create keyspace test_ttl with replication = {'class':
 'SimpleStrategy', 'replication_factor': '1'}; use test_ttl;
 cqlsh> create table items(a text, b text, c1 text, c2 text, c3 text,
 primary key (a, b));
 cqlsh> insert into items(a,b,c1,c2,c3) values('AAA', 'BBB', 'C111',
 'C222', 'C333') using ttl 60;
 bash$ nodetool flush
 bash$ sleep 60
 bash$ nodetool compact test_ttl items
 bash$ sstabledump mc-2-big-Data.db

 [
   {
 "partition" : {
   "key" : [ "AAA" ],
   "position" : 0
 },
 "rows" : [
   {
 "type" : "row",
 "position" : 58,
 "clustering" : [ "BBB" ],
 "liveness_info" : { "tstamp" : "2018-01-10T13:29:25.777Z",
 "ttl" : 60, "expires_at" : "2018-01-10T13:30:25Z", "expired" : true },
 "cells" : [
   { "name" : "c1", "deletion_info" : { "local_delete_time" :
 "2018-01-10T13:29:25Z" }
   },
   { "name" : "c2", "deletion_info" : { "local_delete_time" :
 "2018-01-10T13:29:25Z" }
   },
   { "name" : "c3", "deletion_info" : { "local_delete_time" :
 "2018-01-10T13:29:25Z" }
   }
 ]
   }
 ]
   }
 ]

 The question is why Cassandra creates a tombstone for every column
 instead of single tombstone per row?

 In production environment I have a table with ~30 columns and It gives
 me a warning for 30k tombstones and 300 live rows. It is 30 times more then
 it could be.
 Can this behavior be tuned in some way?

 Thanks.

 --
 Best regards,
 Python_Max.

>>>
>>>
>>
>>
>> --
>> Best regards,
>> Python_Max.
>>
>
>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



-- 
Best regards,
Python_Max.