Determining active sstables and table- dir

2018-04-27 Thread Carl Mueller
IN cases where a table was dropped and re-added, there are now two table
directories with different uuids with sstables.

If you don't have knowledge of which one is active, how do you determine
which is the active table directory? I have tried cf_id from
system.schema_columnfamilies and that can work some of the time but have
seen times cf_id != table-

I have also seen situations where sstables that don't have the
table/columnfamily are in the table dir and are clearly that active
sstables (they compacted when I did a nodetool compact)

Is there a way to get a running cassandra node's sstables for a given
keyspace/table and what table- is active?

This is in a 2.2.x environment that has probably churned a bit from 2.1.x


Rapid scaleup of cassandra nodes with snapshots and initial_token in the yaml

2018-02-14 Thread Carl Mueller
https://stackoverflow.com/questions/48776589/cassandra-cant-one-use-snapshots-to-rapidly-scale-out-a-cluster/48778179#48778179

So the basic question is, if one records tokens and snapshots from an
existing node, via:

nodetool ring | grep ip_address_of_node | awk '{print $NF ","}' | xargs


for the desired node IP

then takes snapshots

then transfers the snapshots to a new node (not yet attached to cluster)

sets up initial_tokens in the yaml

sets up schema to match

then has it join the cluster

Would that allow quick scaleup of nodes/replication of data? I don't care
if the vnode map changes after the initial join, or data starts being
streamed off as it rebalances, as the cluster

Is there an issue if the vnodes tokens for two nodes are identical? Do they
have to be distinct for each node?
Is it that it mucks with the RF since there will be a greater RF than
normal?
Is this just not that practically faster than an sstable load?

Basically, I was wondering if we just use this to double the number of
nodes with identical copies of the node data via snapshots, and then later
on cassandra can pare down which nodes own which data.


Re: Rapid scaleup of cassandra nodes with snapshots and initial_token in the yaml

2018-02-15 Thread Carl Mueller
Or could we do a rapid clone to a new cluster, then add that as another
datacenter?

On Wed, Feb 14, 2018 at 11:40 AM, Carl Mueller <carl.muel...@smartthings.com
> wrote:

> https://stackoverflow.com/questions/48776589/cassandra-
> cant-one-use-snapshots-to-rapidly-scale-out-a-cluster/48778179#48778179
>
> So the basic question is, if one records tokens and snapshots from an
> existing node, via:
>
> nodetool ring | grep ip_address_of_node | awk '{print $NF ","}' | xargs
>
>
> for the desired node IP
>
> then takes snapshots
>
> then transfers the snapshots to a new node (not yet attached to cluster)
>
> sets up initial_tokens in the yaml
>
> sets up schema to match
>
> then has it join the cluster
>
> Would that allow quick scaleup of nodes/replication of data? I don't care
> if the vnode map changes after the initial join, or data starts being
> streamed off as it rebalances, as the cluster
>
> Is there an issue if the vnodes tokens for two nodes are identical? Do
> they have to be distinct for each node?
> Is it that it mucks with the RF since there will be a greater RF than
> normal?
> Is this just not that practically faster than an sstable load?
>
> Basically, I was wondering if we just use this to double the number of
> nodes with identical copies of the node data via snapshots, and then later
> on cassandra can pare down which nodes own which data.
>
>
>


Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread Carl Mueller
Looking through the 2.1.X code I see this:

org.apache.cassandra.io.sstable.Component.java

In the enum for component types there is a CUSTOM enum value which seems to
indicate a catchall for providing metadata for sstables.

Has this been exploited... ever? I noticed in some of the patches for the
archival options on TWCS there are complaints about being able to identify
sstables that are archived and those that aren't.

I would be interested in order to mark the sstables with metadata
indicating the date range an sstable is targetted at for compactions.

discoverComponentsFor seems to explicitly exclude the loadup of any
files/sstable components that are CUSTOM in SStable.java

On Wed, Feb 21, 2018 at 10:05 AM, Carl Mueller <carl.muel...@smartthings.com
> wrote:

> jon: I am planning on writing a custom compaction strategy. That's why the
> question is here, I figured the specifics of memtable -> sstable and
> cassandra internals are not a user question. If that still isn't deep
> enough for the dev thread, I will move all those questions to user.
>
> On Wed, Feb 21, 2018 at 9:59 AM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> Thank you all!
>>
>> On Tue, Feb 20, 2018 at 7:35 PM, kurt greaves <k...@instaclustr.com>
>> wrote:
>>
>>> Probably a lot of work but it would be incredibly useful for vnodes if
>>> flushing was range aware (to be used with RangeAwareCompactionStrategy).
>>> The writers are already range aware for JBOD, but that's not terribly
>>> valuable ATM.
>>>
>>> On 20 February 2018 at 21:57, Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> There are some arguments to be made that the flush should consider
>>>> compaction strategy - would allow a bug flush to respect LCS filesizes or
>>>> break into smaller pieces to try to minimize range overlaps going from l0
>>>> into l1, for example.
>>>>
>>>> I have no idea how much work would be involved, but may be worthwhile.
>>>>
>>>>
>>>> --
>>>> Jeff Jirsa
>>>>
>>>>
>>>> On Feb 20,  2018, at 1:26 PM, Jon Haddad <j...@jonhaddad.com> wrote:
>>>>
>>>> The file format is independent from compaction.  A compaction strategy
>>>> only selects sstables to be compacted, that’s it’s only job.  It could have
>>>> side effects, like generating other files, but any decent compaction
>>>> strategy will account for the fact that those other files don’t exist.
>>>>
>>>> I wrote a blog post a few months ago going over some of the nuance of
>>>> compaction you mind find informative: http://thelastpic
>>>> kle.com/blog/2017/03/16/compaction-nuance.html
>>>>
>>>> This is also the wrong mailing list, please direct future user
>>>> questions to the user list.  The dev list is for development of Cassandra
>>>> itself.
>>>>
>>>> Jon
>>>>
>>>> On Feb 20, 2018, at 1:10 PM, Carl Mueller <carl.muel...@smartthings.com>
>>>> wrote:
>>>>
>>>> When memtables/CommitLogs are flushed to disk/sstable, does the sstable
>>>> go
>>>> through sstable organization specific to each compaction strategy, or is
>>>> the sstable creation the same for all compactionstrats and it is up to
>>>> the
>>>> compaction strategy to recompact the sstable if desired?
>>>>
>>>>
>>>>
>>>
>>
>


Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread Carl Mueller
Also, I was wondering if the key cache maintains a count of how many local
accesses a key undergoes. Such information might be very useful for
compactions of sstables by splitting data by frequency of use so that those
can be preferentially compacted.

On Wed, Feb 21, 2018 at 5:08 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> Looking through the 2.1.X code I see this:
>
> org.apache.cassandra.io.sstable.Component.java
>
> In the enum for component types there is a CUSTOM enum value which seems
> to indicate a catchall for providing metadata for sstables.
>
> Has this been exploited... ever? I noticed in some of the patches for the
> archival options on TWCS there are complaints about being able to identify
> sstables that are archived and those that aren't.
>
> I would be interested in order to mark the sstables with metadata
> indicating the date range an sstable is targetted at for compactions.
>
> discoverComponentsFor seems to explicitly exclude the loadup of any
> files/sstable components that are CUSTOM in SStable.java
>
> On Wed, Feb 21, 2018 at 10:05 AM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> jon: I am planning on writing a custom compaction strategy. That's why
>> the question is here, I figured the specifics of memtable -> sstable and
>> cassandra internals are not a user question. If that still isn't deep
>> enough for the dev thread, I will move all those questions to user.
>>
>> On Wed, Feb 21, 2018 at 9:59 AM, Carl Mueller <
>> carl.muel...@smartthings.com> wrote:
>>
>>> Thank you all!
>>>
>>> On Tue, Feb 20, 2018 at 7:35 PM, kurt greaves <k...@instaclustr.com>
>>> wrote:
>>>
>>>> Probably a lot of work but it would be incredibly useful for vnodes if
>>>> flushing was range aware (to be used with RangeAwareCompactionStrategy).
>>>> The writers are already range aware for JBOD, but that's not terribly
>>>> valuable ATM.
>>>>
>>>> On 20 February 2018 at 21:57, Jeff Jirsa <jji...@gmail.com> wrote:
>>>>
>>>>> There are some arguments to be made that the flush should consider
>>>>> compaction strategy - would allow a bug flush to respect LCS filesizes or
>>>>> break into smaller pieces to try to minimize range overlaps going from l0
>>>>> into l1, for example.
>>>>>
>>>>> I have no idea how much work would be involved, but may be worthwhile.
>>>>>
>>>>>
>>>>> --
>>>>> Jeff Jirsa
>>>>>
>>>>>
>>>>> On Feb 20,  2018, at 1:26 PM, Jon Haddad <j...@jonhaddad.com> wrote:
>>>>>
>>>>> The file format is independent from compaction.  A compaction strategy
>>>>> only selects sstables to be compacted, that’s it’s only job.  It could 
>>>>> have
>>>>> side effects, like generating other files, but any decent compaction
>>>>> strategy will account for the fact that those other files don’t exist.
>>>>>
>>>>> I wrote a blog post a few months ago going over some of the nuance of
>>>>> compaction you mind find informative: http://thelastpic
>>>>> kle.com/blog/2017/03/16/compaction-nuance.html
>>>>>
>>>>> This is also the wrong mailing list, please direct future user
>>>>> questions to the user list.  The dev list is for development of Cassandra
>>>>> itself.
>>>>>
>>>>> Jon
>>>>>
>>>>> On Feb 20, 2018, at 1:10 PM, Carl Mueller <
>>>>> carl.muel...@smartthings.com> wrote:
>>>>>
>>>>> When memtables/CommitLogs are flushed to disk/sstable, does the
>>>>> sstable go
>>>>> through sstable organization specific to each compaction strategy, or
>>>>> is
>>>>> the sstable creation the same for all compactionstrats and it is up to
>>>>> the
>>>>> compaction strategy to recompact the sstable if desired?
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Rapid scaleup of cassandra nodes with snapshots and initial_token in the yaml

2018-02-20 Thread Carl Mueller
Ok, so vnodes are random assignments under normal circumstances (I'm in
2.1.x, I'm assuming a derivative approach was in the works that would avoid
some hot node aspects of random primary range assingment for new nodes once
you had one or two or three in a cluster).

So... couldn't I just "engineer" the token assignments to be new primary
ranges that derive from the replica I am pulling the sstables from... say
take the primary range of the previous node's tokens and just take the
midpoint of it's range? If we stand up enough nodes, then the implicit hot
ranging going on here is mitigated.

The sstables can have data in them that is outside the primary range,
correct? Nodetool clean can get rid of data it doesn't need.

That leaves replicated ranges. Is there any means of replicated range
distribution that isn't the node with the next range slice in the overally
primary range? If we are splitting the old node's primary range, then the
replicas would travel with it and the new node would instantly become a
replica of the old node. the next primary ranges also have the replicas.


On Fri, Feb 16, 2018 at 3:58 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> Thanks. Yeah, it appears this would only be doable if we didn't have
> vnodes and used old single token clusters. I guess Priam has something
> where you increase the cluster by whole number multiples. Then there's the
> issue of doing quorum read/writes if there suddenly is a new replica range
> with grey-area ownership/responsiblity for the range, like where
> LOCAL_QUORUM becomes a bit illdefined if more than one node is being added
> to a cluster.
>
> I guess the only way that would work is if the nodes were some multiple of
> the vnode count and vnodes distributed themselves consistently, so that
> expansions of RF multiples might be consistent and precomputable for
> responsible ranges.
>
> I will read that talk.
>
> On Thu, Feb 15, 2018 at 7:39 PM, kurt greaves <k...@instaclustr.com>
> wrote:
>
>> Ben did a talk
>> <https://www.youtube.com/watch?v=mMZBvPXAhzU=39=PLqcm6qE9lgKJkxYZUOIykswDndrOItnn2>
>> that might have some useful information. It's much more complicated with
>> vnodes though and I doubt you'll be able to get it to be as rapid as you'd
>> want.
>>
>> sets up schema to match
>>
>> This shouldn't be necessary. You'd just join the node as usual but with
>> auto_bootstrap: false and let the schema be propagated.
>>
>> Is there an issue if the vnodes tokens for two nodes are identical? Do
>>> they have to be distinct for each node?
>>
>> Yeah. This is annoying I know. The new node will take over the tokens of
>> the old node, which you don't want.
>>
>>
>>> Basically, I was wondering if we just use this to double the number of
>>> nodes with identical copies of the node data via snapshots, and then later
>>> on cassandra can pare down which nodes own which data.
>>
>> There wouldn't be much point to adding nodes with the same (or almost the
>> same) tokens. That would just be shifting load. You'd essentially need a
>> very smart allocation algorithm to come up with good token ranges, but then
>> you still have the problem of tracking down the relevant SSTables from the
>> nodes. Basically, bootstrap does this for you ATM and only streams the
>> relevant sections of SSTables for the new node. If you were doing it from
>> backups/snapshots you'd need to either do the same thing (eek) or copy all
>> the SSTables from all the relevant nodes.
>>
>> With single token nodes this becomes much easier. You can likely get away
>> with only copying around double/triple the data (depending on how you add
>> tokens to the ring and RF and node count).
>>
>> I'll just put it out there that C* is a database and really isn't
>> designed to be rapidly scalable. If you're going to try, be prepared to
>> invest A LOT of time into it.
>> ​
>>
>
>


Re: Cassandra Needs to Grow Up by Version Five!

2018-02-20 Thread Carl Mueller
I think what is really necessary is providing table-level recipes for
storing data. We need a lot of real world examples and the resulting
schema, compaction strategies, and tunings that were performed for them.
Right now I don't see such crucial cookbook data in the project.

AI is a bit ridiculous, we'd need to AI a collection of big data systems,
and then cassandra would need to have an entirely separate AI system
ingesting ALL THE DATA that comes into the already Big Data system in order
to... what... what would we have the AI do? restructure schemas? Switch
compaction strategeis? Add/subtract nodes? Increase/decrease RF? Those are
all insane things to allocate to AI approaches which are not transparent to
the factors and processing that led to the conclusions. The best we could
hope for are recommendations.

On Tue, Feb 20, 2018 at 5:39 AM, Kyrylo Lebediev 
wrote:

> Agree with you, Daniel, regarding gaps in documentation.
>
>
> ---
>
> At the same time I disagree with the folks who are complaining in this
> thread about some functionality like 'advanced backup' etc is missing out
> of the box.
>
> We all live in the time where there are literally tons of open-source
> tools (automation, monitoring) and languages are available, also there are
> some really powerful SaaS solutions on the market which support C*
> (Datadog, for instance).
>
>
> For example, while C* provides basic building blocks for anti-entropy
> repairs [I mean basic usage of 'nodetool repair' is not suitable for
> large production clusters], Reaper (many thanks to Spotify and
> TheLastPickle!) which uses this basic functionality solves the  task very
> well for real-world C* setups.
>
>
> Something is missing  / could be improved in your opinion - we're in era
> of open-source. Create your own tool, let's say for C* backups automation
> using EBS snapshots, and upload it on GitHub.
>
>
> C* is a DB-engine, not a fully-automated self-contained suite.
> End-users are able to work on automation of routine [3rd party projects],
> meanwhile C* contributors may focus on core functionality.
>
> --
>
> Going back to documentation topic, as far as I understand, DataStax is no
> longer main C* contributor  and is focused on own C*-based proprietary
> software [correct me smb if I'm wrong].
>
> This has led us to the situation when development of C* is progressing (as
> far as I understand, work is done mainly by some large C* users having
> enough resources to contribute to the C* project to get the features they
> need), but there is no single company which has taken over actualization of
> C* documentation / Wiki.
>
> Honestly, even DataStax's documentation is  too concise and  is missing a
> lot of important details.
>
> [BTW, just've taken a look at https://cassandra.apache.org/doc/latest/
> and it looks not that 'bad':  despite of TODOs it contains a lot of
> valuable information]
>
>
> So, I feel the C* Community has to join efforts on enriching existing
> documentation / resurrection of Wiki [where can be placed howto's,
> information about 3rd party automations and integrations etc].
>
> By the Community I mean all of us including myself.
>
>
>
> Regards,
>
> Kyrill
> --
> *From:* Daniel Hölbling-Inzko 
> *Sent:* Tuesday, February 20, 2018 11:28:13 AM
> *To:* user@cassandra.apache.org; James Briggs
>
> *Cc:* d...@cassandra.apache.org
> *Subject:* Re: Cassandra Needs to Grow Up by Version Five!
>
> Hi,
>
> I have to add my own two cents here as the main thing that keeps me from
> really running Cassandra is the amount of pain running it incurs.
> Not so much because it's actually painful but because the tools are so
> different and the documentation and best practices are scattered across a
> dozen outdated DataStax articles and this mailing list etc.. We've been
> hesitant (although our use case is perfect for using Cassandra) to deploy
> Cassandra to any critical systems as even after a year of running it we
> still don't have the operational experience to confidently run critical
> systems with it.
>
> Simple things like a foolproof / safe cluster-wide S3 Backup (like
> Elasticsearch has it) would for example solve a TON of issues for new
> people. I don't need it auto-scheduled or something, but having to
> configure cron jobs across the whole cluster is a pain in the ass for small
> teams.
> To be honest, even the way snapshots are done right now is already super
> painful. Every other system I operated so far will just create one backup
> folder I can export, in C* the Backup is scattered across a bunch of
> different Keyspace folders etc.. needless to say that it took a while until
> I trusted my backup scripts fully.
>
> And especially for a Database I believe Backup/Restore needs to be a
> non-issue that's documented front and center. If not smaller teams just
> don't have the resources to dedicate to learning and building the tools
> 

Re: vnode random token assignment and replicated data antipatterns

2018-02-20 Thread Carl Mueller
So in theory, one could double a cluster by:

1) moving snapshots of each node to a new node.
2) for each snapshot moved, figure out the primary range of the new node by
taking the old node's primary range token and calculating the midpoint
value between that and the next primary range start token
3) the RFs should be preserved since the snapshot have a replicated set of
data for the old primary range, the next primary has a RF already, and so
does the n+1 primary range already

data distribution will be the same as the old primary range distirubtion.

Then nodetool clean and repair would get rid of old data ranges not needed
anymore.

In practice, is this possible? I have heard Priam can double clusters and
they do not use vnodes. I am assuming they do a similar approach but they
only have to calculate single tokens?

On Tue, Feb 20, 2018 at 11:21 AM, Carl Mueller <carl.muel...@smartthings.com
> wrote:

> As I understand it: Replicas of data are replicated to the next primary
> range owner.
>
> As tokens are randomly generated (at least in 2.1.x that I am on), can't
> we have this situation:
>
> Say we have RF3, but the tokens happen to line up where:
>
> NodeA handles 0-10
> NodeB handles  11-20
> NodeA handlea 21-30
> NodeB handles 31-40
> NodeC handles 40-50
>
> The key aspect of that is that the random assignment of primary range
> vnode tokens has resulted in NodeA and NodeB being the primaries for four
> adjacent primary ranges.
>
> IF RF is replicated by going to the next adjacent nodes in the primary
> range, and we are, say RF3, then B will have a replica of A, and then the
> THIRD REPLICA IS BACK ON A.
>
> Is the RF distribution durable to this by ignoring the reappearance of A
> and then cycling through until a unique node (NodeC) is encountered, and
> then that becomes the third replica?
>
>
>
>


Re: Cassandra counter readtimeout error

2018-02-20 Thread Carl Mueller
How "hot" are your partition keys in these counters?

I would think, theoretically, if specific partition keys are getting
thousands of counter increments/mutations updates, then compaction won't
"compact" those together into the final value, and you'll start
experiencing the problems people get with rows with thousands of tombstones.

So if you had an event 'birthdaypartyattendance'

and you had 1110 separate updates doing +1s/+2s/+3s to the attendance count
for that event (what a bday party!), then when you went to select that
final attendance value, with many of those increments may still be on other
nodes and not fully replicated, then it will have to read 1110 cells and
accumulate them to the final value. When replication has completed and
compaction runs, it should amalgamate those. QUORUM-write will help with
ensuring the counter mutations are written to the proper number of nodes,
with the usual three node wait overhead.

DISCLAIMER: I don't have working knowledge of the code in distributed
counters. I just know they are a really hard problem and don't work great
in 2.x. As said, 3.x seems to be a lot better.

On Mon, Feb 19, 2018 at 10:43 AM, Alain RODRIGUEZ 
wrote:

> Hi Javier,
>
> Glad to hear it is solved now. Cassandra 3.11.1 should be a more stable
> version and 3.11 a better series.
>
> Excuse my misunderstanding, your table seems to be better designed than
> thought.
>
> Welcome to the Apache Cassandra community!
>
> C*heers ;-)
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
> 2018-02-19 9:31 GMT+00:00 Javier Pareja :
>
>> Hi,
>>
>> Thank you for your reply.
>>
>> As I was bothered by this problem, last night I upgraded the cluster to
>> version 3.11.1 and everything is working now. As far as I can tell the
>> counter table can be read now. I will be doing more testing today with this
>> version but it is looking good.
>>
>> To answer your questions:
>> - I might not have explained the table definition very well but the table
>> does not have 6 partitions, but 6 partition keys. There are thousands of
>> partitions in that table, a combination of all those partition keys. I also
>> made sure that the partitions remained small when designing the table.
>> - I also enabled tracing in the CQLSH but it showed nothing when querying
>> this row. It however did when querying other tables...
>>
>> Thanks again for your reply!! I am very excited to be part of the
>> Cassandra user base.
>>
>> Javier
>>
>>
>>
>> F Javier Pareja
>>
>> On Mon, Feb 19, 2018 at 8:08 AM, Alain RODRIGUEZ 
>> wrote:
>>
>>>
>>> Hello,
>>>
>>> This table has 6 partition keys, 4 primary keys and 5 counters.
>>>
>>>
>>> I think the root issue is this ^. There might be some inefficiency or
>>> issues with counter, but this design, makes Cassandra relatively
>>> inefficient in most cases and using standard columns or counters
>>> indifferently.
>>>
>>> Cassandra data is supposed to be well distributed for a maximal
>>> efficiency. With only 6 partitions, if you have 6+ nodes, there is 100%
>>> chances that the load is fairly imbalanced. If you have less nodes, it's
>>> still probably poorly balanced. Also reading from a small number of
>>> sstables and in parallel within many nodes ideally to split the work and
>>> make queries efficient, but in this case cassandra is reading huge
>>> partitions from one node most probably. When the size of the request is too
>>> big it can timeout. I am not sure how pagination works with counters, but I
>>> believe even if pagination is working, at some point, you are just reading
>>> too much (or too inefficiently) and the timeout is reached.
>>>
>>> I imagined it worked well for a while as counters are very small columns
>>> / tables compared to any event data but at some point you might have
>>> reached 'physical' limit, because you are pulling *all* the information
>>> you need from one partition (and probably many SSTables)
>>>
>>> Is there really no other way to design this use case?
>>>
>>> When data starts to be inserted, I can query the counters correctly from
 that particular row but after a few minutes updating the table with
 thousands of events, I get a read timeout every time

>>>
>>> Troubleshot:
>>> - Use tracing to understand what takes so long with your queries
>>> - Check for warns / error in the logs. Cassandra use to complain when it
>>> is unhappy with the configurations. There a lot of interesting and it's
>>> been a while I last had a failure with no relevant informations in the logs.
>>> - Check SSTable per read and other read performances for this counter
>>> table. Using some monitoring could make the reason of this timeout obvious.
>>> If you use Datadog for example, I guess that a quick look at the "Read
>>> Path" Dashboard would help. If you are using 

vnode random token assignment and replicated data antipatterns

2018-02-20 Thread Carl Mueller
As I understand it: Replicas of data are replicated to the next primary
range owner.

As tokens are randomly generated (at least in 2.1.x that I am on), can't we
have this situation:

Say we have RF3, but the tokens happen to line up where:

NodeA handles 0-10
NodeB handles  11-20
NodeA handlea 21-30
NodeB handles 31-40
NodeC handles 40-50

The key aspect of that is that the random assignment of primary range vnode
tokens has resulted in NodeA and NodeB being the primaries for four
adjacent primary ranges.

IF RF is replicated by going to the next adjacent nodes in the primary
range, and we are, say RF3, then B will have a replica of A, and then the
THIRD REPLICA IS BACK ON A.

Is the RF distribution durable to this by ignoring the reappearance of A
and then cycling through until a unique node (NodeC) is encountered, and
then that becomes the third replica?


Re: vnode random token assignment and replicated data antipatterns

2018-02-20 Thread Carl Mueller
Ahhh, the topology strategy does that.

But if one were to maintain the same rack topology and was adding nodes
just within the racks... Hm, might not be possible in new nodes. ALthough
AWS "racks" are at the availability zone IIRC, so that would be doable.

Outside of rack awareness, would the next primary ranges take the replica
ranges?

On Tue, Feb 20, 2018 at 11:45 AM, Jon Haddad <j...@jonhaddad.com> wrote:

> That’s why you use a NTS + a snitch, it picks replaces based on rack
> awareness.
>
>
> On Feb 20, 2018, at 9:33 AM, Carl Mueller <carl.muel...@smartthings.com>
> wrote:
>
> So in theory, one could double a cluster by:
>
> 1) moving snapshots of each node to a new node.
> 2) for each snapshot moved, figure out the primary range of the new node
> by taking the old node's primary range token and calculating the midpoint
> value between that and the next primary range start token
> 3) the RFs should be preserved since the snapshot have a replicated set of
> data for the old primary range, the next primary has a RF already, and so
> does the n+1 primary range already
>
> data distribution will be the same as the old primary range distirubtion.
>
> Then nodetool clean and repair would get rid of old data ranges not needed
> anymore.
>
> In practice, is this possible? I have heard Priam can double clusters and
> they do not use vnodes. I am assuming they do a similar approach but they
> only have to calculate single tokens?
>
> On Tue, Feb 20, 2018 at 11:21 AM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> As I understand it: Replicas of data are replicated to the next primary
>> range owner.
>>
>> As tokens are randomly generated (at least in 2.1.x that I am on), can't
>> we have this situation:
>>
>> Say we have RF3, but the tokens happen to line up where:
>>
>> NodeA handles 0-10
>> NodeB handles  11-20
>> NodeA handlea 21-30
>> NodeB handles 31-40
>> NodeC handles 40-50
>>
>> The key aspect of that is that the random assignment of primary range
>> vnode tokens has resulted in NodeA and NodeB being the primaries for four
>> adjacent primary ranges.
>>
>> IF RF is replicated by going to the next adjacent nodes in the primary
>> range, and we are, say RF3, then B will have a replica of A, and then the
>> THIRD REPLICA IS BACK ON A.
>>
>> Is the RF distribution durable to this by ignoring the reappearance of A
>> and then cycling through until a unique node (NodeC) is encountered, and
>> then that becomes the third replica?
>>
>>
>>
>>
>
>


Re: Cluster Repairs 'nodetool repair -pr' Cause Severe IncreaseinRead Latency After Shrinking Cluster

2018-02-22 Thread Carl Mueller
; *Subject: *RE: Cluster Repairs 'nodetool repair -pr' Cause Severe
> IncreaseinRead Latency After Shrinking Cluster
>
>
>
>
>
> “ data was allowed to fully rebalance/repair/drain before the next node
> was taken off?”
>
> --
>
> Judging by the messages, the decomm was healthy. As an example
>
>
>
>   StorageService.java:3425 - Announcing that I have left the ring for
> 3ms
>
> …
>
> INFO  [RMI TCP Connection(4)-127.0.0.1] 2016-01-07 06:00:52,662
> StorageService.java:1191 – DECOMMISSIONED
>
>
>
> I do not believe repairs were run after each node removal. I’ll
> double-check.
>
>
>
> I’m not sure what you mean by ‘rebalance’? How do you check if a node is
> balanced? Load/size of data dir?
>
>
>
> As for the drain, there was no need to drain and I believe it is not
> something you do as part of decomm’ing a node.
>
>
> did you take 1 off per rack/AZ?
>
> --
>
> We removed 3 nodes, one from each AZ in sequence
>
>
>
> These are some of the cfhistogram metrics. Read latencies are high after
> the removal of the nodes
>
> --
>
> You can see reads of 186ms are at the 99th% from 5 sstables. There are
> awfully high numbers given that these metrics measure C* storage layer read
> performance.
>
>
>
> Does this mean removing the nodes undersized the cluster?
>
>
>
> key_space_01/cf_01 histograms
>
> Percentile  SSTables Write Latency  Read LatencyPartition
> SizeCell Count
>
>   (micros)  (micros)
> (bytes)
>
> 50% 1.00 24.60   4055.27
> 11864 4
>
> 75% 2.00 35.43  14530.76
> 17084 4
>
> 95% 4.00126.93  89970.66
> 35425 4
>
> 98% 5.00219.34 155469.30
> 73457 4
>
> 99% 5.00219.34 186563.16
> 105778 4
>
> Min 0.00  5.72 17.09
> 87 3
>
> Max 7.00  20924.301386179.89
> 14530764 4
>
>
>
> key_space_01/cf_01 histograms
>
> Percentile  SSTables Write Latency  Read LatencyPartition
> SizeCell Count
>
>   (micros)  (micros)
> (bytes)
>
> 50% 1.00 29.52   4055.27
> 11864 4
>
> 75% 2.00 42.51  10090.81
> 17084 4
>
> 95% 4.00152.32  52066.35
> 35425 4
>
> 98% 4.00219.34  89970.66
> 73457 4
>
> 99% 5.00219.34 155469.30
> 88148 4
>
> Min     0.00  9.89 24.60
> 87 0
>
> Max 6.00   1955.67 557074.61
> 14530764 4
>
>
>
> 
> Thank you
>
>
>
> *From: *Carl Mueller <carl.muel...@smartthings.com>
> *Sent: *Wednesday, February 21, 2018 4:33 PM
> *To: *user@cassandra.apache.org
> *Subject: *Re: Cluster Repairs 'nodetool repair -pr' Cause Severe
> Increase inRead Latency After Shrinking Cluster
>
>
>
> Hm nodetool decommision performs the streamout of the replicated data, and
> you said that was apparently without error...
>
> But if you dropped three nodes in one AZ/rack on a five node with RF3,
> then we have a missing RF factor unless NetworkTopologyStrategy fails over
> to another AZ. But that would also entail cross-az streaming and queries
> and repair.
>
>
>
> On Wed, Feb 21, 2018 at 3:30 PM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
> sorry for the idiot questions...
>
> data was allowed to fully rebalance/repair/drain before the next node was
> taken off?
>
> did you take 1 off per rack/AZ?
>
>
>
>
>
> On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote:
>
> One node at a time
>
>
>
> On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com>
> wrote:
>
> What is your replication factor?
> Single datacenter, three availability zones, is that right?
> You removed one node at a time or three at once?
>
>
>
> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote:
>
> We have had a 15 node cluster across three zones and cluster repairs using
> ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the
> cluster to 12. Since then, same repair job has taken up to 12 hours to
> finish and most times, it never does.
>
>
>
> More importantly, at some point during the repair cycle, we see read
> latencies jumping to 1-2 seconds and applications immediately notice the
> impact.
>
>
>
> stream_throughput_outbound_megabits_per_sec is set at 200 and
> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
> around ~500GB at 44% usage.
>
>
>
> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
> completed successfully with no issues.
>
>
>
> What could possibly cause repairs to cause this impact following cluster
> downsizing? Taking three nodes out does not seem compatible with such a
> drastic effect on repair and read latency.
>
>
>
> Any expert insights will be appreciated.
>
> 
> Thank you
>
>
>
>
>
>
>
>
>
>
>
>
>


Re: Rapid scaleup of cassandra nodes with snapshots and initial_token in the yaml

2018-02-16 Thread Carl Mueller
Thanks. Yeah, it appears this would only be doable if we didn't have vnodes
and used old single token clusters. I guess Priam has something where you
increase the cluster by whole number multiples. Then there's the issue of
doing quorum read/writes if there suddenly is a new replica range with
grey-area ownership/responsiblity for the range, like where LOCAL_QUORUM
becomes a bit illdefined if more than one node is being added to a cluster.

I guess the only way that would work is if the nodes were some multiple of
the vnode count and vnodes distributed themselves consistently, so that
expansions of RF multiples might be consistent and precomputable for
responsible ranges.

I will read that talk.

On Thu, Feb 15, 2018 at 7:39 PM, kurt greaves  wrote:

> Ben did a talk
> 
> that might have some useful information. It's much more complicated with
> vnodes though and I doubt you'll be able to get it to be as rapid as you'd
> want.
>
> sets up schema to match
>
> This shouldn't be necessary. You'd just join the node as usual but with
> auto_bootstrap: false and let the schema be propagated.
>
> Is there an issue if the vnodes tokens for two nodes are identical? Do
>> they have to be distinct for each node?
>
> Yeah. This is annoying I know. The new node will take over the tokens of
> the old node, which you don't want.
>
>
>> Basically, I was wondering if we just use this to double the number of
>> nodes with identical copies of the node data via snapshots, and then later
>> on cassandra can pare down which nodes own which data.
>
> There wouldn't be much point to adding nodes with the same (or almost the
> same) tokens. That would just be shifting load. You'd essentially need a
> very smart allocation algorithm to come up with good token ranges, but then
> you still have the problem of tracking down the relevant SSTables from the
> nodes. Basically, bootstrap does this for you ATM and only streams the
> relevant sections of SSTables for the new node. If you were doing it from
> backups/snapshots you'd need to either do the same thing (eek) or copy all
> the SSTables from all the relevant nodes.
>
> With single token nodes this becomes much easier. You can likely get away
> with only copying around double/triple the data (depending on how you add
> tokens to the ring and RF and node count).
>
> I'll just put it out there that C* is a database and really isn't designed
> to be rapidly scalable. If you're going to try, be prepared to invest A LOT
> of time into it.
> ​
>


Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread Carl Mueller
Thank you all!

On Tue, Feb 20, 2018 at 7:35 PM, kurt greaves <k...@instaclustr.com> wrote:

> Probably a lot of work but it would be incredibly useful for vnodes if
> flushing was range aware (to be used with RangeAwareCompactionStrategy).
> The writers are already range aware for JBOD, but that's not terribly
> valuable ATM.
>
> On 20 February 2018 at 21:57, Jeff Jirsa <jji...@gmail.com> wrote:
>
>> There are some arguments to be made that the flush should consider
>> compaction strategy - would allow a bug flush to respect LCS filesizes or
>> break into smaller pieces to try to minimize range overlaps going from l0
>> into l1, for example.
>>
>> I have no idea how much work would be involved, but may be worthwhile.
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Feb 20,  2018, at 1:26 PM, Jon Haddad <j...@jonhaddad.com> wrote:
>>
>> The file format is independent from compaction.  A compaction strategy
>> only selects sstables to be compacted, that’s it’s only job.  It could have
>> side effects, like generating other files, but any decent compaction
>> strategy will account for the fact that those other files don’t exist.
>>
>> I wrote a blog post a few months ago going over some of the nuance of
>> compaction you mind find informative: http://thelastpic
>> kle.com/blog/2017/03/16/compaction-nuance.html
>>
>> This is also the wrong mailing list, please direct future user questions
>> to the user list.  The dev list is for development of Cassandra itself.
>>
>> Jon
>>
>> On Feb 20, 2018, at 1:10 PM, Carl Mueller <carl.muel...@smartthings.com>
>> wrote:
>>
>> When memtables/CommitLogs are flushed to disk/sstable, does the sstable go
>> through sstable organization specific to each compaction strategy, or is
>> the sstable creation the same for all compactionstrats and it is up to the
>> compaction strategy to recompact the sstable if desired?
>>
>>
>>
>


Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread Carl Mueller
jon: I am planning on writing a custom compaction strategy. That's why the
question is here, I figured the specifics of memtable -> sstable and
cassandra internals are not a user question. If that still isn't deep
enough for the dev thread, I will move all those questions to user.

On Wed, Feb 21, 2018 at 9:59 AM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> Thank you all!
>
> On Tue, Feb 20, 2018 at 7:35 PM, kurt greaves <k...@instaclustr.com>
> wrote:
>
>> Probably a lot of work but it would be incredibly useful for vnodes if
>> flushing was range aware (to be used with RangeAwareCompactionStrategy).
>> The writers are already range aware for JBOD, but that's not terribly
>> valuable ATM.
>>
>> On 20 February 2018 at 21:57, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>>> There are some arguments to be made that the flush should consider
>>> compaction strategy - would allow a bug flush to respect LCS filesizes or
>>> break into smaller pieces to try to minimize range overlaps going from l0
>>> into l1, for example.
>>>
>>> I have no idea how much work would be involved, but may be worthwhile.
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> On Feb 20,  2018, at 1:26 PM, Jon Haddad <j...@jonhaddad.com> wrote:
>>>
>>> The file format is independent from compaction.  A compaction strategy
>>> only selects sstables to be compacted, that’s it’s only job.  It could have
>>> side effects, like generating other files, but any decent compaction
>>> strategy will account for the fact that those other files don’t exist.
>>>
>>> I wrote a blog post a few months ago going over some of the nuance of
>>> compaction you mind find informative: http://thelastpic
>>> kle.com/blog/2017/03/16/compaction-nuance.html
>>>
>>> This is also the wrong mailing list, please direct future user questions
>>> to the user list.  The dev list is for development of Cassandra itself.
>>>
>>> Jon
>>>
>>> On Feb 20, 2018, at 1:10 PM, Carl Mueller <carl.muel...@smartthings.com>
>>> wrote:
>>>
>>> When memtables/CommitLogs are flushed to disk/sstable, does the sstable
>>> go
>>> through sstable organization specific to each compaction strategy, or is
>>> the sstable creation the same for all compactionstrats and it is up to
>>> the
>>> compaction strategy to recompact the sstable if desired?
>>>
>>>
>>>
>>
>


Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-21 Thread Carl Mueller
DCs can be stood up with snapshotted data.


Stand up a new cluster with your old cluster snapshots:

https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_snapshot_restore_new_cluster.html

Then link the DCs together.

Disclaimer: I've never done this in real life.

On Wed, Feb 21, 2018 at 9:25 AM, Nitan Kainth  wrote:

> New dc will be faster but may impact cluster performance due to streaming.
>
> Sent from my iPhone
>
> On Feb 21, 2018, at 8:53 AM, Leena Ghatpande 
> wrote:
>
> We do use LOCAL_ONE and LOCAL_Quorum currently. But these 8 nodes need to
> be in 2 different DC< so we would end up create additional 2 new DC and
> dropping 2.
>
> are there any advantages on adding DC over one node at a time?
>
>
> --
> *From:* Jeff Jirsa 
> *Sent:* Wednesday, February 21, 2018 1:02 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Best approach to Replace existing 8 smaller nodes in
> production cluster with New 8 nodes that are bigger in capacity, without a
> downtime
>
> You add the nodes with rf=0 so there’s no streaming, then bump it to rf=1
> and run repair, then rf=2 and run repair, then rf=3 and run repair, then
> you either change the app to use local quorum in the new dc, or reverse the
> process by decreasing the rf in the original dc by 1 at a time
>
> --
> Jeff Jirsa
>
>
> > On Feb 20, 2018, at 8:51 PM, Kyrylo Lebediev 
> wrote:
> >
> > I'd say, "add new DC, then remove old DC" approach is more risky
> especially if they use QUORUM CL (in this case they will need to change CL
> to LOCAL_QUORUM, otherwise they'll run into a lot of blocking read repairs).
> > Also, if there is a chance to get rid of streaming, it worth doing as
> usually direct data copy (not by means of C*) is more effective and less
> troublesome.
> >
> > Regards,
> > Kyrill
> >
> > 
> > From: Nitan Kainth 
> > Sent: Wednesday, February 21, 2018 1:04:05 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Best approach to Replace existing 8 smaller nodes in
> production cluster with New 8 nodes that are bigger in capacity, without a
> downtime
> >
> > You can also create a new DC and then terminate old one.
> >
> > Sent from my iPhone
> >
> >> On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev 
> wrote:
> >>
> >> Hi,
> >> Consider using this approach, replacing nodes one by one:
> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-
> place-node-replacement/
>
> 
> Cassandra instantaneous in place node replacement | Carlos ...
> 
> mrcalonso.com
> At some point everyone using Cassandra faces the situation of having to
> replace nodes. Either because the cluster needs to scale and some nodes are
> too small or ...
>
> >>
> >> Regards,
> >> Kyrill
> >>
> >> 
> >> From: Leena Ghatpande 
> >> Sent: Tuesday, February 20, 2018 10:24:24 PM
> >> To: user@cassandra.apache.org
> >> Subject: Best approach to Replace existing 8 smaller nodes in
> production cluster with New 8 nodes that are bigger in capacity, without a
> downtime
> >>
> >> Best approach to replace existing 8 smaller 8 nodes in production
> cluster with New 8 nodes that are bigger in capacity without a downtime
> >>
> >> We have 4 nodes each in 2 DC, and we want to replace these 8 nodes with
> new 8 nodes that are bigger in capacity in terms of RAM,CPU and Diskspace
> without a downtime.
> >> The RF is set to 3 currently, and we have 2 large tables with upto
> 70Million rows
> >>
> >> What would be the best approach to implement this
> >>- Add 1 New Node and Decomission 1 Old node at a time?
> >>- Add all New nodes to the cluster, and then decommission old nodes ?
> >>If we do this, can we still keep the RF=3 while we have 16 nodes
> at a point in the cluster before we start decommissioning?
> >>   - How long do we wait in between adding a Node or decomissiing to
> ensure the process is complete before we proceed?
> >>   - Any tool that we can use to monitor if the add/decomission node is
> done before we proceed to next
> >>
> >> Any other suggestion?
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: user-h...@cassandra.apache.org
> >>
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: 

Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Carl Mueller
What is your replication factor?
Single datacenter, three availability zones, is that right?
You removed one node at a time or three at once?

On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash  wrote:

> We have had a 15 node cluster across three zones and cluster repairs using
> ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk the
> cluster to 12. Since then, same repair job has taken up to 12 hours to
> finish and most times, it never does.
>
>
>
> More importantly, at some point during the repair cycle, we see read
> latencies jumping to 1-2 seconds and applications immediately notice the
> impact.
>
>
>
> stream_throughput_outbound_megabits_per_sec is set at 200 and
> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
> around ~500GB at 44% usage.
>
>
>
> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
> completed successfully with no issues.
>
>
>
> What could possibly cause repairs to cause this impact following cluster
> downsizing? Taking three nodes out does not seem compatible with such a
> drastic effect on repair and read latency.
>
>
>
> Any expert insights will be appreciated.
>
> 
> Thank you
>
>
>


Re: Best approach to Replace existing 8 smaller nodes in production cluster with New 8 nodes that are bigger in capacity, without a downtime

2018-02-21 Thread Carl Mueller
I don't disagree with jon.

On Wed, Feb 21, 2018 at 10:27 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> The easiest way to do this is replacing one node at a time by using
> rsync.  I don't know why it has to be more complicated than copying data to
> a new machine and replacing it in the cluster.   Bringing up a new DC with
> snapshots is going to be a nightmare in comparison.
>
> On Wed, Feb 21, 2018 at 8:16 AM Carl Mueller <carl.muel...@smartthings.com>
> wrote:
>
>> DCs can be stood up with snapshotted data.
>>
>>
>> Stand up a new cluster with your old cluster snapshots:
>>
>> https://docs.datastax.com/en/cassandra/2.1/cassandra/
>> operations/ops_snapshot_restore_new_cluster.html
>>
>> Then link the DCs together.
>>
>> Disclaimer: I've never done this in real life.
>>
>> On Wed, Feb 21, 2018 at 9:25 AM, Nitan Kainth <nitankai...@gmail.com>
>> wrote:
>>
>>> New dc will be faster but may impact cluster performance due to
>>> streaming.
>>>
>>> Sent from my iPhone
>>>
>>> On Feb 21, 2018, at 8:53 AM, Leena Ghatpande <lghatpa...@hotmail.com>
>>> wrote:
>>>
>>> We do use LOCAL_ONE and LOCAL_Quorum currently. But these 8 nodes need
>>> to be in 2 different DC< so we would end up create additional 2 new DC and
>>> dropping 2.
>>>
>>> are there any advantages on adding DC over one node at a time?
>>>
>>>
>>> --
>>> *From:* Jeff Jirsa <jji...@gmail.com>
>>> *Sent:* Wednesday, February 21, 2018 1:02 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Best approach to Replace existing 8 smaller nodes in
>>> production cluster with New 8 nodes that are bigger in capacity, without a
>>> downtime
>>>
>>> You add the nodes with rf=0 so there’s no streaming, then bump it to
>>> rf=1 and run repair, then rf=2 and run repair, then rf=3 and run repair,
>>> then you either change the app to use local quorum in the new dc, or
>>> reverse the process by decreasing the rf in the original dc by 1 at a time
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> > On Feb 20, 2018, at 8:51 PM, Kyrylo Lebediev <kyrylo_lebed...@epam.com>
>>> wrote:
>>> >
>>> > I'd say, "add new DC, then remove old DC" approach is more risky
>>> especially if they use QUORUM CL (in this case they will need to change CL
>>> to LOCAL_QUORUM, otherwise they'll run into a lot of blocking read repairs).
>>> > Also, if there is a chance to get rid of streaming, it worth doing as
>>> usually direct data copy (not by means of C*) is more effective and less
>>> troublesome.
>>> >
>>> > Regards,
>>> > Kyrill
>>> >
>>> > 
>>> > From: Nitan Kainth <nitankai...@gmail.com>
>>> > Sent: Wednesday, February 21, 2018 1:04:05 AM
>>> > To: user@cassandra.apache.org
>>> > Subject: Re: Best approach to Replace existing 8 smaller nodes in
>>> production cluster with New 8 nodes that are bigger in capacity, without a
>>> downtime
>>> >
>>> > You can also create a new DC and then terminate old one.
>>> >
>>> > Sent from my iPhone
>>> >
>>> >> On Feb 20, 2018, at 2:49 PM, Kyrylo Lebediev <
>>> kyrylo_lebed...@epam.com> wrote:
>>> >>
>>> >> Hi,
>>> >> Consider using this approach, replacing nodes one by one:
>>> https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-
>>> place-node-replacement/
>>>
>>> <https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement/>
>>> Cassandra instantaneous in place node replacement | Carlos ...
>>> <https://mrcalonso.com/2016/01/26/cassandra-instantaneous-in-place-node-replacement/>
>>> mrcalonso.com
>>> At some point everyone using Cassandra faces the situation of having to
>>> replace nodes. Either because the cluster needs to scale and some nodes are
>>> too small or ...
>>>
>>> >>
>>> >> Regards,
>>> >> Kyrill
>>> >>
>>> >> 
>>> >> From: Leena Ghatpande <lghatpa...@hotmail.com>
>>> >> Sent: Tuesday, February 20, 2018 10:24:24 PM
>>> >> To: user@cassandra.apache.org
&g

Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Carl Mueller
sorry for the idiot questions...

data was allowed to fully rebalance/repair/drain before the next node was
taken off?

did you take 1 off per rack/AZ?


On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote:

> One node at a time
>
> On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com>
> wrote:
>
>> What is your replication factor?
>> Single datacenter, three availability zones, is that right?
>> You removed one node at a time or three at once?
>>
>> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote:
>>
>>> We have had a 15 node cluster across three zones and cluster repairs
>>> using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk
>>> the cluster to 12. Since then, same repair job has taken up to 12 hours to
>>> finish and most times, it never does.
>>>
>>>
>>>
>>> More importantly, at some point during the repair cycle, we see read
>>> latencies jumping to 1-2 seconds and applications immediately notice the
>>> impact.
>>>
>>>
>>>
>>> stream_throughput_outbound_megabits_per_sec is set at 200 and
>>> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
>>> around ~500GB at 44% usage.
>>>
>>>
>>>
>>> When shrinking the cluster, the ‘nodetool decommision’ was eventless. It
>>> completed successfully with no issues.
>>>
>>>
>>>
>>> What could possibly cause repairs to cause this impact following cluster
>>> downsizing? Taking three nodes out does not seem compatible with such a
>>> drastic effect on repair and read latency.
>>>
>>>
>>>
>>> Any expert insights will be appreciated.
>>>
>>> 
>>> Thank you
>>>
>>>
>>>
>>
>>


Re: Cluster Repairs 'nodetool repair -pr' Cause Severe Increase in Read Latency After Shrinking Cluster

2018-02-21 Thread Carl Mueller
Hm nodetool decommision performs the streamout of the replicated data, and
you said that was apparently without error...

But if you dropped three nodes in one AZ/rack on a five node with RF3, then
we have a missing RF factor unless NetworkTopologyStrategy fails over to
another AZ. But that would also entail cross-az streaming and queries and
repair.

On Wed, Feb 21, 2018 at 3:30 PM, Carl Mueller <carl.muel...@smartthings.com>
wrote:

> sorry for the idiot questions...
>
> data was allowed to fully rebalance/repair/drain before the next node was
> taken off?
>
> did you take 1 off per rack/AZ?
>
>
> On Wed, Feb 21, 2018 at 12:29 PM, Fred Habash <fmhab...@gmail.com> wrote:
>
>> One node at a time
>>
>> On Feb 21, 2018 10:23 AM, "Carl Mueller" <carl.muel...@smartthings.com>
>> wrote:
>>
>>> What is your replication factor?
>>> Single datacenter, three availability zones, is that right?
>>> You removed one node at a time or three at once?
>>>
>>> On Wed, Feb 21, 2018 at 10:20 AM, Fd Habash <fmhab...@gmail.com> wrote:
>>>
>>>> We have had a 15 node cluster across three zones and cluster repairs
>>>> using ‘nodetool repair -pr’ took about 3 hours to finish. Lately, we shrunk
>>>> the cluster to 12. Since then, same repair job has taken up to 12 hours to
>>>> finish and most times, it never does.
>>>>
>>>>
>>>>
>>>> More importantly, at some point during the repair cycle, we see read
>>>> latencies jumping to 1-2 seconds and applications immediately notice the
>>>> impact.
>>>>
>>>>
>>>>
>>>> stream_throughput_outbound_megabits_per_sec is set at 200 and
>>>> compaction_throughput_mb_per_sec at 64. The /data dir on the nodes is
>>>> around ~500GB at 44% usage.
>>>>
>>>>
>>>>
>>>> When shrinking the cluster, the ‘nodetool decommision’ was eventless.
>>>> It completed successfully with no issues.
>>>>
>>>>
>>>>
>>>> What could possibly cause repairs to cause this impact following
>>>> cluster downsizing? Taking three nodes out does not seem compatible with
>>>> such a drastic effect on repair and read latency.
>>>>
>>>>
>>>>
>>>> Any expert insights will be appreciated.
>>>>
>>>> 
>>>> Thank you
>>>>
>>>>
>>>>
>>>
>>>
>


Re: Performance Of IN Queries On Wide Rows

2018-02-21 Thread Carl Mueller
Cass 2.1.14 is missing some wide row optimizations done in later cass
releases IIRC.

Speculation: IN won't matter, it will load the entire wide row into memory
regardless which might spike your GC/heap and overflow the rowcache

On Wed, Feb 21, 2018 at 2:16 PM, Gareth Collins 
wrote:

> Thanks for the response!
>
> I could understand that being the case if the Cassandra cluster is not
> loaded. Splitting the work across multiple nodes would obviously make
> the query faster.
>
> But if this was just a single node, shouldn't one IN query be faster
> than multiple due to the fact that, if I understand correctly,
> Cassandra should need to do less work?
>
> thanks in advance,
> Gareth
>
> On Wed, Feb 21, 2018 at 7:27 AM, Rahul Singh
>  wrote:
> > That depends on the driver you use but separate queries asynchronously
> > around the cluster would be faster.
> >
> >
> > --
> > Rahul Singh
> > rahul.si...@anant.us
> >
> > Anant Corporation
> >
> > On Feb 20, 2018, 6:48 PM -0500, Eric Stevens , wrote:
> >
> > Someone can correct me if I'm wrong, but I believe if you do a large
> IN() on
> > a single partition's cluster keys, all the reads are going to be served
> from
> > a single replica.  Compared to many concurrent individual equal
> statements
> > you can get the performance gain of leaning on several replicas for
> > parallelism.
> >
> > On Tue, Feb 20, 2018 at 11:43 AM Gareth Collins <
> gareth.o.coll...@gmail.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> When querying large wide rows for multiple specific values is it
> >> better to do separate queries for each value...or do it with one query
> >> and an "IN"? I am using Cassandra 2.1.14
> >>
> >> I am asking because I had changed my app to use 'IN' queries and it
> >> **appears** to be slower rather than faster. I had assumed that the
> >> "IN" query should be faster...as I assumed it only needs to go down
> >> the read path once (i.e. row cache -> memtable -> key cache -> bloom
> >> filter -> index summary -> index -> compaction -> sstable) rather than
> >> once for each entry? Or are there some additional caveats that I
> >> should be aware of for 'IN' query performance (e.g. ordering of 'IN'
> >> query entries, closeness of 'IN' query values in the SSTable etc.)?
> >>
> >> thanks in advance,
> >> Gareth Collins
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: user-h...@cassandra.apache.org
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


"minimum backup" in vnodes

2018-08-15 Thread Carl Mueller
Goal: backup a cluster with the minimum amount of data. Restore to be done
with sstableloader

Let's start with a basic case:
- six node cluster
- one datacenter
- RF3
- data is perfectly replicated/repaired
- Manual tokens (no vnodes)
- simplest strategy

In this case, it is (theoretically) possible to get an perfect backup of
data by storing the snapshots of two of the six nodes in the cluster due to
replication factor.

I once tried to parse the ring output with vnodes (256) and came to the
conclusion that it was not possible with vnodes, maybe you could avoid one
or two nodes of the six... tops. But I may have had an incorrect
understanding of how ranges are replicated in vnodes.

Would it be possible to pick only two nodes out of a six node cluster with
vnodes and RF-3 that will backup the cluster?


Re: [EXTERNAL] Re: Nodetool refresh v/s sstableloader

2018-08-30 Thread Carl Mueller
- Range aware compaction strategy that subdivides data by the token range
could help for this: you only bakcup data for the primary node and not the
replica data
- yes, if you want to use nodetool refresh as some sort of recovery
solution, MAKE SURE YOU STORE THE TOKEN LIST with the
sstables/snapshots/backups for the nodes.

On Wed, Aug 29, 2018 at 8:57 AM Durity, Sean R 
wrote:

> Sstableloader, though, could require a lot more disk space – until
> compaction can reduce. For example, if your RF=3, you will essentially be
> loading 3 copies of the data. Then it will get replicated 3 more times as
> it is being loaded. Thus, you could need up to 9x disk space.
>
>
>
>
>
> Sean Durity
>
> *From:* kurt greaves 
> *Sent:* Wednesday, August 29, 2018 7:26 AM
> *To:* User 
> *Subject:* [EXTERNAL] Re: Nodetool refresh v/s sstableloader
>
>
>
> Removing dev...
>
> Nodetool refresh only picks up new SSTables that have been placed in the
> tables directory. It doesn't account for actual ownership of the data like
> SSTableloader does. Refresh will only work properly if the SSTables you are
> copying in are completely covered by that nodes tokens. It doesn't work if
> there's a change in topology, replication and token ownership will have to
> be more or less the same.
>
>
>
> SSTableloader will break up the SSTables and send the relevant bits to
> whichever node needs it, so no need for you to worry about tokens and
> copying data to the right places, it will do that for you.
>
>
>
> On 28 August 2018 at 11:27, Rajath Subramanyam  wrote:
>
> Hi Cassandra users, Cassandra dev,
>
>
>
> When recovering using SSTables from a snapshot, I want to know what are
> the key differences between using:
>
> 1. Nodetool refresh and,
>
> 2. SSTableloader
>
>
>
> Does nodetool refresh have restrictions that need to be met?
> Does nodetool refresh work even if there is a change in the topology
> between the source cluster and the destination cluster? Does it work if the
> token ranges don't match between the source cluster and the destination
> cluster? Does it work when an old SSTable in the snapshot has a dropped
> column that is not part of the current schema?
>
>
>
> I appreciate any help in advance.
>
>
>
> Thanks,
>
> Rajath
>
> 
>
> Rajath Subramanyam
>
>
>
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: Data Deleted After a few days of being off

2018-02-27 Thread Carl Mueller
Does cassandra still function if the commitlog dir has no writes? Will the
data still go into the memtable and serve queries?

On Tue, Feb 27, 2018 at 1:37 AM, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Tue, Feb 27, 2018 at 7:37 AM, A  wrote:
>
>>
>> I started going through the logs and haven't noticed anything yet... Very
>> unexpected behavior.
>>
>
> Maybe I'm asking the obvious, but were your inserts *without* a TTL?
>
> --
> Alex
>
>


Re: Filling in the blank To Do sections on the Apache Cassandra web site

2018-02-27 Thread Carl Mueller
If there was a github for the docs, we could start posting content to it
for review. Not sure what the review/contribution process is on Apache.
Google searches on apache documentation and similar run into lots of noise
from actual projects.

I wouldn't mind trying to do a little doc work on the regular if there was
a wiki, a proven means to do collaborative docs.


On Tue, Feb 27, 2018 at 11:42 AM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> It’s just content for web pages.  There isn’t a working outline or any
> draft on any of the JIRA’s yet.  I like to keep things simple.  Did I miss
> something?  What does it matter right now?
>
>
>
> Thanks Carl,
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Carl Mueller [mailto:carl.muel...@smartthings.com]
> *Sent:* Tuesday, February 27, 2018 8:50 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
> so... are those pages in the code tree of github? I don't see them or a
> directory structure under /doc. Is mirroring the documentation between the
> apache site and a github source a big issue?
>
>
>
> On Tue, Feb 27, 2018 at 7:50 AM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> I was debating that.  Splitting it up into smaller tasks makes each one
> seem less over-whelming.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Josh McKenzie [mailto:jmckenzie@apacheorg <jmcken...@apache.org>]
> *Sent:* Tuesday, February 27, 2018 5:44 AM
> *To:* cassandra
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
> Might help, organizationally, to put all these efforts under a single
> ticket of "Improve web site Documentation" and add these as sub-tasks
> Should be able to do that translation post-creation (i.e. in its current
> state) if that's something that makes sense to you.
>
>
>
> On Mon, Feb 26, 2018 at 5:24 PM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> Here are the related JIRA’s.  Please add content even if It’s not well
> formed compositionally.  Myself or someone else will take it from there
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-14274  The
> troubleshooting section of the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14273  The Bulk Loading
> web page on the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14272  The Backups web
> page on the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14271  The Hints web page
> in the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14270  The Read repair
> web page is empty
>
> https://issuesapache.org/jira/browse/CASSANDRA-14269
> <https://issues.apache.org/jira/browse/CASSANDRA-14269>  The Data
> Modeling section of the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14268  The
> Architecture:Guarantees web page is empty
>
> https://issuesapache.org/jira/browse/CASSANDRA-14267
> <https://issues.apache.org/jira/browse/CASSANDRA-14267>  The Dynamo web
> page on the Apache Cassandra site is missing content
>
> https://issues.apache.org/jira/browse/CASSANDRA-14266  The Architecture
> Overview web page on the Apache Cassandra site is empty
>
>
>
> Thanks for pitching in
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
> *Sent:* Monday, February 26, 2018 1:54 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
> Nice!  Thanks for the help Oliver!
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Oliver Ruebenacker [mailto:cur...@gmail.com]
> *Sent:* Sunday, February 25, 2018 7:12 AM
> *To:* user@cassandra.apache.org
> *Cc:* dev@cassandra.apacheorg <d...@cassandra.apache.org>
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
>
>
>  Hello,
>
>   I have some slides about Cassandra
> <https://docs.google.com/presentation/d/1JZYugL4WC9grgZswg1i6gAfWmFBqg9iQ0YcPLUiQ-6w/edit?usp=sharing>,
> feel free to borrow.
>
>  Best, Oliver
>
>
>
> On Fri, Feb 23, 2018 at 7:28 PM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> These nine web pages on the Apache Cassandra web site have blank To Do
> sections.  Most of the web pages are completely blank.  Mind you there is a
> lot of hard work already done on the documentation.  I’ll make JIRA’s for
> any of the blank sections where there is not

Re: Filling in the blank To Do sections on the Apache Cassandra web site

2018-02-27 Thread Carl Mueller
so... are those pages in the code tree of github? I don't see them or a
directory structure under /doc. Is mirroring the documentation between the
apache site and a github source a big issue?

On Tue, Feb 27, 2018 at 7:50 AM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> I was debating that.  Splitting it up into smaller tasks makes each one
> seem less over-whelming.
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Josh McKenzie [mailto:jmcken...@apache.org]
> *Sent:* Tuesday, February 27, 2018 5:44 AM
> *To:* cassandra
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
> Might help, organizationally, to put all these efforts under a single
> ticket of "Improve web site Documentation" and add these as sub-tasks.
> Should be able to do that translation post-creation (i.e. in its current
> state) if that's something that makes sense to you.
>
>
>
> On Mon, Feb 26, 2018 at 5:24 PM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> Here are the related JIRA’s.  Please add content even if It’s not well
> formed compositionally.  Myself or someone else will take it from there
>
>
>
> https://issues.apache.org/jira/browse/CASSANDRA-14274  The
> troubleshooting section of the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14273  The Bulk Loading
> web page on the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14272  The Backups web
> page on the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14271  The Hints web page
> in the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14270  The Read repair
> web page is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14269  The Data Modeling
> section of the web site is empty
>
> https://issues.apache.org/jira/browse/CASSANDRA-14268  The
> Architecture:Guarantees web page is empty
>
> https://issuesapache.org/jira/browse/CASSANDRA-14267
>   The Dynamo web
> page on the Apache Cassandra site is missing content
>
> https://issues.apache.org/jira/browse/CASSANDRA-14266  The Architecture
> Overview web page on the Apache Cassandra site is empty
>
>
>
> Thanks for pitching in
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
> *Sent:* Monday, February 26, 2018 1:54 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
> Nice!  Thanks for the help Oliver!
>
>
>
> Kenneth Brotman
>
>
>
> *From:* Oliver Ruebenacker [mailto:cur...@gmail.com]
> *Sent:* Sunday, February 25, 2018 7:12 AM
> *To:* user@cassandra.apache.org
> *Cc:* d...@cassandra.apache.org
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
>
>
>
>
>  Hello,
>
>   I have some slides about Cassandra
> ,
> feel free to borrow.
>
>  Best, Oliver
>
>
>
> On Fri, Feb 23, 2018 at 7:28 PM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
>
> These nine web pages on the Apache Cassandra web site have blank To Do
> sections.  Most of the web pages are completely blank.  Mind you there is a
> lot of hard work already done on the documentation.  I’ll make JIRA’s for
> any of the blank sections where there is not already a JIRA.  Then it will
> be on to writing up those sections.  *If you have any text to help me get
> started for any of these sections that would be really cool. *
>
>
>
> http://cassandra.apache.org/doc/latest/architecture/overview.html
>
>
>
> http://cassandra.apache.org/doc/latest/architecture/dynamo.html
>
>
>
> http://cassandra.apache.org/doc/latest/architecture/guarantees.html
> 
>
>
>
> http://cassandra.apache.org/doc/latest/data_modeling/index.html
>
>
>
> http://cassandra.apacheorg/doc/latest/operating/read_repair.html
> 
>
>
>
> http://cassandra.apache.org/doc/latest/operating/hints.html
>
>
>
> http://cassandra.apache.org/doc/latest/operating/backups.html
>
>
>
> http://cassandra.apache.org/doc/latest/operating/bulk_loading.html
>
>
>
> http://cassandra.apache.org/doc/latest/troubleshooting/index.html
>
>
>
> Kenneth Brotman
>
>
>
>
>
>
> --
>
> Oliver Ruebenacker
>
> Senior Software Engineer, Diabetes Portal
> , Broad Institute
> 
>
>
>
>
>


Re: Version Rollback

2018-02-27 Thread Carl Mueller
My speculation is that IF (bigif) the sstable formats are compatible
between the versions, which probably isn't the case for major versions,
then you could drop back.

If the sstables changed format, then you'll probably need to figure out how
to rewrite the sstables in the older format and then sstableloader them in
the older-version cluster if need be. Alas, while there is an sstable
upgrader, there isn't a downgrader AFAIK.

And I don't have an intimate view of version-by-version sstable format
changes and compatibilities. You'd probably need to check the upgrade
instructions (which you presumably did if you're upgrading versions) to
tell.

Basically, version rollback is pretty unlikely to be done.

The OTHER option:

1) build a new cluster with the new version, no new data.

2) code your driver interfaces to interface with both clusters. Write to
both, but read preferentially from the new, then fall through to the old.
Yes, that gets hairy on multiple row queries. Port your data with sstable
loading from the old to the new gradually.

When you've done a full load of all the data from old to new, and you're
satisfied with the new cluster stability, retire the old cluster.

For merging two multirow sets you'll probably need your multirow queries to
return the partition hash value (or extract the code that generates the
hash), and have two simulaneous java-driver ResultSets going, and merge
their results, providing the illusion of a single database query. You'll
need to pay attention to both the row key ordering and column key ordering
to ensure the combined results are properly ordered.

Writes will be slowed by the double-writes, reads you'll be bound by the
worse performing cluster.

On Tue, Feb 27, 2018 at 8:23 AM, Kenneth Brotman <
kenbrot...@yahoo.com.invalid> wrote:

> Could you tell us the size and configuration of your Cassandra cluster?
>
>
>
> Kenneth Brotman
>
>
>
> *From:* shalom sagges [mailto:shalomsag...@gmail.com]
> *Sent:* Tuesday, February 27, 2018 6:19 AM
> *To:* user@cassandra.apache.org
> *Subject:* Version Rollback
>
>
>
> Hi All,
>
> I'm planning to upgrade my C* cluster to version 3.x and was wondering
> what's the best way to perform a rollback if need be.
>
> If I used snapshot restoration, I would be facing data loss, depends when
> I took the snapshot (i.e. a rollback might be required after upgrading half
> the cluster for example).
>
> If I add another DC to the cluster with the old version, then I could
> point the apps to talk to that DC if anything bad happens, but building it
> is really time consuming and requires a lot of resources.
>
> Can anyone provide recommendations on this matter? Any ideas on how to
> make the upgrade foolproof, or at least "really really safe"?
>
>
>
> Thanks!
>
>
>


Re: Filling in the blank To Do sections on the Apache Cassandra web site

2018-02-27 Thread Carl Mueller
Nice thanks


On Tue, Feb 27, 2018 at 12:03 PM, Jon Haddad <j...@jonhaddad.com> wrote:

> There’s a section dedicated to contributing to Cassandra documentation in
> the docs as well: https://cassandra.apache.org/doc/latest/
> development/documentation.html
>
>
>
> On Feb 27, 2018, at 9:55 AM, Kenneth Brotman <kenbrot...@yahoo.com.INVALID>
> wrote:
>
> I was just getting ready to install sphinx.  Cool.
>
> *From:* Jon Haddad [mailto:jonathan.had...@gmail.com
> <jonathan.had...@gmail.com>] *On Behalf Of *Jon Haddad
> *Sent:* Tuesday, February 27, 2018 9:51 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
> The docs have been in tree for years :)
>
> https://github.com/apache/cassandra/tree/trunk/doc
>
> There’s even a docker image to build them so you don’t need to mess with
> sphinx.  Check the README for instructions.
>
> Jon
>
>
> On Feb 27, 2018, at 9:49 AM, Carl Mueller <carl.muel...@smartthings.com>
> wrote:
>
>
> If there was a github for the docs, we could start posting content to it
> for review. Not sure what the review/contribution process is on Apache.
> Google searches on apache documentation and similar run into lots of noise
> from actual projects.
>
> I wouldn't mind trying to do a little doc work on the regular if there was
> a wiki, a proven means to do collaborative docs.
>
> On Tue, Feb 27, 2018 at 11:42 AM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
> It’s just content for web pages.  There isn’t a working outline or any
> draft on any of the JIRA’s yet.  I like to keep things simple.  Did I miss
> something?  What does it matter right now?
>
> Thanks Carl,
>
> Kenneth Brotman
>
> *From:* Carl Mueller [mailto:carl.muel...@smartthings.com]
> *Sent:* Tuesday, February 27, 2018 8:50 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
> so... are those pages in the code tree of github? I don't see them or a
> directory structure under /doc. Is mirroring the documentation between the
> apache site and a github source a big issue?
>
> On Tue, Feb 27, 2018 at 7:50 AM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
> I was debating that.  Splitting it up into smaller tasks makes each one
> seem less over-whelming.
>
> Kenneth Brotman
>
> *From:* Josh McKenzie [mailto:jmckenzie@apacheorg <jmcken...@apache.org>]
> *Sent:* Tuesday, February 27, 2018 5:44 AM
> *To:* cassandra
> *Subject:* Re: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
> Might help, organizationally, to put all these efforts under a single
> ticket of "Improve web site Documentation" and add these as sub-tasks
> Should be able to do that translation post-creation (i.e. in its current
> state) if that's something that makes sense to you.
>
> On Mon, Feb 26, 2018 at 5:24 PM, Kenneth Brotman <
> kenbrot...@yahoo.com.invalid> wrote:
> Here are the related JIRA’s.  Please add content even if It’s not well
> formed compositionally.  Myself or someone else will take it from there
>
> https://issues.apache.org/jira/browse/CASSANDRA-14274  The
> troubleshooting section of the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14273  The Bulk Loading
> web page on the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14272  The Backups web
> page on the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14271  The Hints web page
> in the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14270  The Read repair
> web page is empty
> https://issuesapache.org/jira/browse/CASSANDRA-14269
> <https://issues.apache.org/jira/browse/CASSANDRA-14269>  The Data
> Modeling section of the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14268  The
> Architecture:Guarantees web page is empty
> https://issuesapache.org/jira/browse/CASSANDRA-14267
> <https://issues.apache.org/jira/browse/CASSANDRA-14267>  The Dynamo web
> page on the Apache Cassandra site is missing content
> https://issues.apache.org/jira/browse/CASSANDRA-14266  The Architecture
> Overview web page on the Apache Cassandra site is empty
>
> Thanks for pitching in
>
> Kenneth Brotman
>
> *From:* Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
> *Sent:* Monday, February 26, 2018 1:54 PM
> *To:* user@cassandra.apache.org
> *Subject:* RE: Filling in the blank To Do sections on the Apache
> Cassandra web site
>
> Nice!  Thanks for the help Oliver!
>
> Kenneth Brotman
>
> *From:* 

Re: Cassandra at Instagram with Dikang Gu interview by Jeff Carpenter

2018-03-12 Thread Carl Mueller
Again, I'd really like to get a feel for scylla vs rocksandra vs cassandra.

Isn't the driver binary protocol the easiest / least redesign level of
storage engine swapping? Scylla and Cassandra and Rocksandra are currently
three options. Rocksandra can expand out it's non-java footprint without
rearchitecting the java codebase. Or are there serious concerns with
Datastax and the binary protocols?

On Tue, Mar 6, 2018 at 12:42 PM, Goutham reddy 
wrote:

> It’s an interesting conversation. For more details about the pluggable
> storage engine here is the link.
>
> Blog:
> https://thenewstack.io/instagram-supercharges-cassandra-pluggable-rocksdb-
> storage-engine/
>
> JIRA:
> https://issues.apache.org/jira/plugins/servlet/mobile#
> issue/CASSANDRA-13475
>
>
> On Tue, Mar 6, 2018 at 9:01 AM Kenneth Brotman
>  wrote:
>
>> Just released on DataStax Distributed Data Show, DiKang Gu of Instagram
>> interviewed by author Jeff Carpenter.
>>
>> Found it really interesting:  Shadow clustering, migrating from 2.2 to
>> 3.0, using the Rocks DB as a pluggable storage engine for Cassandra
>>
>> https://academy.datastax.com/content/distributed-data-show-
>> episode-37-cassandra-instagram-dikang-gu
>>
>>
>>
>> Kenneth Brotman
>>
> --
> Regards
> Goutham Reddy
>


Re: Cassandra vs MySQL

2018-03-14 Thread Carl Mueller
THERE ARE NO JOINS WITH CASSANDRA

CQL != SQL

Same for aggregation, subqueries, etc. And effectively multitable
transactions are out.

If you have simple single-table queries and updates, or can convert the app
to do so, then you're in business.

On Tue, Mar 13, 2018 at 5:02 AM, Rahul Singh 
wrote:

> Oliver,
>
>
> Here’s the criteria I have for you:
>
> 1. Do you need massive concurrency on reads and writes ?
>
> If not you can replicate MySQL using master slave. Or consider Galera -
> Maria DB master master. I’ve not used it but then again doesn’t mean that
> it doesn’t work. If you have time to experiment , please do a comparison
> with Galera vs. Cassandra. ;)
>
> 2. Do you plan on doing both OLTP and OLAP on the same data?
>
> Cassandra can replicate data to different Datacenters so you can
> concurrently do heavy read and write on one Logical Datacenter and
> simultaneously have another Logical Datacenter for analytics.
>
> 3. Do you have a ridiculously strict SLA to maintain? And does it need to
> be global?
>
> If you don’t need to be up and running all the time and don’t need a
> global platform, don’t bother using Cassandra.
>
> Exporting a relational schema and importing into Cassandra will be a box
> of hurt. In my professional (the type of experience that comes from people
> paying me to make judgments, decisions ) experience with Cassandra, the
> biggest mistake is people thinking that since CQL is similar to SQL that it
> is just like SQL. It’s not. The keys and literally “no relationships” mean
> that all the tables should be “Report tables” or “direct object tables.”
> That being said if you don’t do a lot of joins and arbitrary selects on any
> field, Cassandra can help achieve massive scale.
>
> The statement that “Cassandra is going to die in a few time” is the same
> thing people said about Java and .NET. They are still here decades later.
> Cassandra has achieved critical mass. So much that a company made a C++
> version of it and Microsoft supports a global Database as a service version
> of it called Cosmos, not to mention that DataStax supports huge global
> brands on a commercial build of it. It’s not going anywhere.
>
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Mar 12, 2018, 3:58 PM -0400, Oliver Ruebenacker ,
> wrote:
>
>
>  Hello,
>
>   We have a project currently using MySQL single-node with 5-6TB of data
> and some performance issues, and we plan to add data up to a total size of
> maybe 25-30TB.
>
>   We are thinking of migrating to Cassandra. I have been trying to find
> benchmarks or other guidelines to compare MySQL and Cassandra, but most of
> them seem to be five years old or older.
>
>   Is there some good more recent material?
>
>   Thanks!
>
>  Best, Oliver
>
> --
> Oliver Ruebenacker
> Senior Software Engineer, Diabetes Portal
> , Broad Institute
> 
>
>


Re: Rocksandra blog post

2018-03-06 Thread Carl Mueller
Basically they are avoiding gc, right? Not necessarily improving on the
theoreticals of sstables and LSM trees.

Why didn't they use/try scylla? I'd be interested to see that benchmark.

On Tue, Mar 6, 2018 at 3:48 AM, Romain Hardouin  wrote:

> Rocksandra is very interesting for key/value data model. Let's hope it
> will land in C* upstream in the near future thanks to pluggable storage.
> Thanks Dikang!
>
>
>
> Le mardi 6 mars 2018 à 10:06:16 UTC+1, Kyrylo Lebediev <
> kyrylo_lebed...@epam.com> a écrit :
>
>
> Thanks for sharing, Dikang!
>
> Impressive results.
>
>
> As you plugged in different storage engine, it's interesting how you're
> dealing with compactions in Rocksandra?
>
> Is there still the concept of immutable SSTables + compaction strategies
> or it was changed somehow?
>
>
> Best,
>
> Kyrill
>
> --
> *From:* Dikang Gu 
> *Sent:* Monday, March 5, 2018 8:26 PM
> *To:* d...@cassandra.apache.org; cassandra
> *Subject:* Rocksandra blog post
>
> As some of you already know, Instagram Cassandra team is working on the
> project to use RocksDB as Cassandra's storage engine.
>
> Today, we just published a blog post about the work we have done, and more
> excitingly, we published the benchmark metrics in AWS environment.
>
> Check it out here:
> https://engineering.instagram.com/open-sourcing-a-10x-
> reduction-in-apache-cassandra-tail-latency-d64f86b43589
>
> Thanks
> Dikang
>
>


Re: data types storage saving

2018-03-06 Thread Carl Mueller
If you're willing to do the data type conversion in insert and retrieval,
the you could use blobs as a sort of "adaptive length int" AFAIK

On Tue, Mar 6, 2018 at 6:02 AM, onmstester onmstester 
wrote:

> I'm using int data type for one of my columns but for 99.99...% its data
> never would be > 65K, Should i change it to smallint (It would save some
> Gigabytes disks in a few months) or Cassandra Compression would take care
> of it in storage?
> What about blob data type ? Isn't  better to use it in such cases? could i
> alter column type from smallInt to int in future if needed?
>
> Sent using Zoho Mail 
>
>
>


Re: [EXTERNAL] Cassandra vs MySQL

2018-03-20 Thread Carl Mueller
Yes, cassandra's big win is that once you get your data and applications
adapted to the platform, you have a clear path to very very large scale and
resiliency. Um, assuming you have the dollars. It scales out on commodity
hardware, but isn't exactly efficient in the use of that hardware. I like
to say that Cassandra makes big data "bigger data" because of the
timestamp-per-cell and column name overhead and replication factor.

On Tue, Mar 20, 2018 at 2:54 PM, Jeff Jirsa  wrote:

> I suspect you're approaching this problem from the wrong side.
>
> The decision of MySQL vs Cassandra isn't usually about performance, it's
> about the other features that may impact/enable that performance.
>
> - Will you have a data set that won't fit on any single MySQL Server?
> - Will you want to write into two different hot datacenters at the same
> time?
> - Do you want to be able to restart any single server without impacting
> the cluster?
>
> If you answer yes to those, then cassandra has an option to do so
> trivially, where you'd have to build tooling with MySQL.
>
> - Do you want to do arbitrary text searches?
> - Do you need JOINs?
> - Do you want to build indices on a lot of the columns and do ad-hoc
> querying?
>
> If you answer yes to those, they're far easier in MySQL than Cassandra.
>
> If you're just looking for "Cassandra can do X writes per second and MySQL
> can do Y writes per second", those types of benchmarks are rarely relevant,
> because in both cases they tend to require expert tuning to get the full
> potential (and very few people are experts in both) and data dependent (and
> your data probably doesn't match the benchmarker's dataset).
>
> If I had a dataset that was ~10-20gb and wanted to do arbitrary reads on
> the data, I'd choose MySQL unless I absolutely positively could not
> tolerate downtime, in which case I'd go with Cassandra spanning multiple
> datacenters. If I had a dataset that was 200TB, or 200PB, I'd choose
> Cassandra, even if I could theoretically make MySQL do it faster, because
> the extra effort in building the tooling to manage that many shards of
> MySQL would be prohibitive to most organizations.
>
>
>
>
>
>
>
> On Tue, Mar 20, 2018 at 11:44 AM, Oliver Ruebenacker 
> wrote:
>
>>
>>  Hello,
>>
>>   Thanks for all the responses.
>>
>>   I do know some SQL and CQL, so I know the main differences. You can do
>> joins in MySQL, but the bigger your data, the less likely you want to do
>> that.
>>
>>   If you are a team that wants to consider migrating from MySQL to
>> Cassandra, you need some reason to believe that it is going to be faster.
>> What evidence is there?
>>
>>   Even the Cassandra home page has references to benchmarks to make the
>> case for Cassandra. Unfortunately, they seem to be about five to six years
>> old. It doesn't make sense to keep them there if you just can't compare.
>>
>>  Best, Oliver
>>
>> On Tue, Mar 20, 2018 at 1:13 PM, Durity, Sean R <
>> sean_r_dur...@homedepot.com> wrote:
>>
>>> I’m not sure there is a fair comparison. MySQL and Cassandra have
>>> different ways of solving related (but not necessarily the same) problems
>>> of storing and retrieving data.
>>>
>>>
>>>
>>> The data model between MySQL and Cassandra is likely to be very
>>> different. The key for Cassandra is that you need to model for the queries
>>> that will be executed. If you cannot know the queries ahead of time,
>>> Cassandra is not the best choice. If table scans are typically required,
>>> Cassandra is not a good choice. If you need more than a few hundred tables
>>> in a cluster, Cassandra is not a good choice.
>>>
>>>
>>>
>>> If multi-datacenter replication is required, Cassandra is an awesome
>>> choice. If you are going to always query by a partition key (or primary
>>> key), Cassandra is a great choice. The nice thing is that the performance
>>> scales linearly, so additional data is fine (as long as you add nodes) –
>>> again, if your data model is designed for Cassandra. If you like
>>> no-downtime upgrades and extreme reliability and availability, Cassandra is
>>> a great choice.
>>>
>>>
>>>
>>> Personally, I hope to never have to use/support MySQL again, and I love
>>> working with Cassandra. But, Cassandra is not the choice for all data
>>> problems.
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Oliver Ruebenacker [mailto:cur...@gmail.com]
>>> *Sent:* Monday, March 12, 2018 3:58 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* [EXTERNAL] Cassandra vs MySQL
>>>
>>>
>>>
>>>
>>>
>>>  Hello,
>>>
>>>   We have a project currently using MySQL single-node with 5-6TB of data
>>> and some performance issues, and we plan to add data up to a total size of
>>> maybe 25-30TB.
>>>
>>>   We are thinking of migrating to Cassandra. I have been trying to find
>>> benchmarks or other guidelines to compare MySQL and Cassandra, but most of
>>> them seem to be five years old or older.
>>>
>>>   Is there some good 

Re: One time major deletion/purge vs periodic deletion

2018-03-20 Thread Carl Mueller
It's possible you'll run into compaction headaches. Likely actually.

If you have time-bucketed purge/archives, I'd implement a time bucketing
strategy using rotating tables dedicated to a time period so that when an
entire table is ready for archiving you just snapshot its sstables and then
TRUNCATE/nuke the time bucket table.

Queries that span buckets and calculating the table to target on inserts
are a major pain in the ass, but at scale you'll probably want to consider
dingo something like this.

On Wed, Mar 7, 2018 at 8:19 PM, kurt greaves  wrote:

> The important point to consider is whether you are deleting old data or
> recently written data. How old/recent depends on your write rate to the
> cluster and there's no real formula. Basically you want to avoid deleting a
> lot of old data all at once because the tombstones will end up in new
> SSTables and the data to be deleted will live in higher levels (LCS) or
> large SSTables (STCS), which won't get compacted together for a long time.
> In this case it makes no difference if you do a big purge or if you break
> it up, because at the end of the day if your big purge is just old data,
> all the tombstones will have to stick around for awhile until they make it
> to the higher levels/bigger SSTables.
>
> If you have to purge large amounts of old data, the easiest way is to 1.
> Make sure you have at least 50% disk free (for large/major compactions)
> and/or 2. Use garbagecollect compactions (3.10+)
> ​
>


cassl 2.1.x seed node update via JMX

2018-03-22 Thread Carl Mueller
We have a cluster that is subject to the one-year gossip bug.

We'd like to update the seed node list via JMX without restart, since our
foolishly single-seed-node in this forsaken cluster is being autoculled in
AWS.

Is this possible? It is not marked volatile in the Config of the source
code, so I doubt it.


Re: cassl 2.1.x seed node update via JMX

2018-03-22 Thread Carl Mueller
Thanks. The rolling restart triggers the gossip bug so that's a no go.
We'lre going to migrate off the clsuter. Thanks!



On Thu, Mar 22, 2018 at 5:04 PM, Nate McCall <n...@thelastpickle.com> wrote:

> This capability was *just* added in CASSANDRA-14190 and only in trunk.
>
> Previously (as described in the ticket above), the seed node list is only
> updated when doing a shadow round, removing an endpoint or restarting (look
> for callers of o.a.c.gms.Gossiper#buildSeedsList() if you're curious).
>
> A rolling restart is the usual SOP for that.
>
> On Fri, Mar 23, 2018 at 9:54 AM, Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> We have a cluster that is subject to the one-year gossip bug.
>>
>> We'd like to update the seed node list via JMX without restart, since our
>> foolishly single-seed-node in this forsaken cluster is being autoculled in
>> AWS.
>>
>> Is this possible? It is not marked volatile in the Config of the source
>> code, so I doubt it.
>>
>
>
>
> --
> -
> Nate McCall
> Wellington, NZ
> @zznate
>
> CTO
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>


Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-10-16 Thread Carl Mueller
Your dashboards are great. The only challenge is getting all the data to
feed them.


On Tue, Oct 16, 2018 at 1:45 PM Carl Mueller 
wrote:

> metadata.csv: that helps a lot, thank you!
>
> On Fri, Oct 5, 2018 at 5:42 AM Alain RODRIGUEZ  wrote:
>
>> I feel you for most of the troubles you faced, I've been facing most of
>> them too. Again, Datadog support can probably help you with most of those.
>> You should really consider sharing this feedback to them.
>>
>> there is re-namespacing of the metric names in lots of cases, and these
>>> don't appear to be centrally documented, but maybe i haven't found the
>>> magic page.
>>>
>>
>> I don't know if that would be the 'magic' page, but that's something:
>> https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv
>>
>> There are so many good stats.
>>
>>
>> Yes, and it's still improving. I love this about Cassandra. It's our work
>> to pick the relevant ones for each situation. I would not like Cassandra to
>> reduce the number of metrics exposed, we need to learn to handle them
>> properly. Also, this is the reason we designed 4 dashboards out the box,
>> the goal was to have everything we need for distinct scenarios:
>> - Overview - global health-check / anomaly detection
>> - Read Path - troubleshooting / optimizing read ops
>> - Write Path - troubleshooting / optimizing write ops
>> - SSTable Management - troubleshooting / optimizing -
>> comapction/flushes/... anything related to sstables.
>>
>> instead of the single overview dashboard that was present before. We are
>> also perfectly aware that it's far from perfect, but aiming at perfect
>> would only have had us never releasing anything. Anyone interested could
>> now build missing dashboards or improve existing ones for himself or/and
>> suggest improvements to Datadog :). I hope I'll do some more of this work
>> at some point in the future.
>>
>> Good luck,
>> C*heers,
>> ---
>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> Le jeu. 4 oct. 2018 à 21:21, Carl Mueller
>>  a écrit :
>>
>>> for 2.1.x we had a custom reporter that delivered  metrics to datadog's
>>> endpoint via https, bypassing the agent-imposed 350. But integrating that
>>> required targetting the other shared libs in the cassandra path, so the
>>> build is a bit of a pain when we update major versions.
>>>
>>> We are migrating our 2.1.x specific dashboards, and we will use
>>> agent-delivered metrics for non-table, and adapt the custom library to
>>> deliver the table-based ones, at a slower rate than the "core" ones.
>>>
>>> Datadog is also super annoying because there doesn't appear to be
>>> anything that reports what metrics the agent is sending (the metric count
>>> can indicate if a configured new metric increased the count and is being
>>> reported, but it's still... a guess), and there is re-namespacing of the
>>> metric names in lots of cases, and these don't appear to be centrally
>>> documented, but maybe i haven't found the magic page.
>>>
>>> There are so many good stats. We might also implement some facility
>>> to dynamically turn on the delivery of detailed metrics on the nodes.
>>>
>>> On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ 
>>> wrote:
>>>
>>>> Hello Carl,
>>>>
>>>> I guess we can use bean_regex to do specific targetted metrics for the
>>>>> important tables anyway.
>>>>>
>>>>
>>>> Yes, this would work, but 350 is very limited for Cassandra dashboards.
>>>> We have a LOT of metrics available.
>>>>
>>>> Datadog 350 metric limit is a PITA for tables once you get over 10
>>>>> tables
>>>>>
>>>>
>>>> I noticed this while I was working on providing default dashboards for
>>>> Cassandra-Datadog integration. I was told by Datadog team it would not be
>>>> an issue for users, that I should not care about it. As you pointed out,
>>>> per table metrics quickly increase the total number of metrics we need to
>>>> collect.
>>>>
>>>> I believe you can set the following option: *"max_returned_metrics:
>>>> 1000"* - it can be used if metrics are missing to increase the limit
>>>> of the num

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-10-16 Thread Carl Mueller
metadata.csv: that helps a lot, thank you!

On Fri, Oct 5, 2018 at 5:42 AM Alain RODRIGUEZ  wrote:

> I feel you for most of the troubles you faced, I've been facing most of
> them too. Again, Datadog support can probably help you with most of those.
> You should really consider sharing this feedback to them.
>
> there is re-namespacing of the metric names in lots of cases, and these
>> don't appear to be centrally documented, but maybe i haven't found the
>> magic page.
>>
>
> I don't know if that would be the 'magic' page, but that's something:
> https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv
>
> There are so many good stats.
>
>
> Yes, and it's still improving. I love this about Cassandra. It's our work
> to pick the relevant ones for each situation. I would not like Cassandra to
> reduce the number of metrics exposed, we need to learn to handle them
> properly. Also, this is the reason we designed 4 dashboards out the box,
> the goal was to have everything we need for distinct scenarios:
> - Overview - global health-check / anomaly detection
> - Read Path - troubleshooting / optimizing read ops
> - Write Path - troubleshooting / optimizing write ops
> - SSTable Management - troubleshooting / optimizing -
> comapction/flushes/... anything related to sstables.
>
> instead of the single overview dashboard that was present before. We are
> also perfectly aware that it's far from perfect, but aiming at perfect
> would only have had us never releasing anything. Anyone interested could
> now build missing dashboards or improve existing ones for himself or/and
> suggest improvements to Datadog :). I hope I'll do some more of this work
> at some point in the future.
>
> Good luck,
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le jeu. 4 oct. 2018 à 21:21, Carl Mueller
>  a écrit :
>
>> for 2.1.x we had a custom reporter that delivered  metrics to datadog's
>> endpoint via https, bypassing the agent-imposed 350. But integrating that
>> required targetting the other shared libs in the cassandra path, so the
>> build is a bit of a pain when we update major versions.
>>
>> We are migrating our 2.1.x specific dashboards, and we will use
>> agent-delivered metrics for non-table, and adapt the custom library to
>> deliver the table-based ones, at a slower rate than the "core" ones.
>>
>> Datadog is also super annoying because there doesn't appear to be
>> anything that reports what metrics the agent is sending (the metric count
>> can indicate if a configured new metric increased the count and is being
>> reported, but it's still... a guess), and there is re-namespacing of the
>> metric names in lots of cases, and these don't appear to be centrally
>> documented, but maybe i haven't found the magic page.
>>
>> There are so many good stats. We might also implement some facility
>> to dynamically turn on the delivery of detailed metrics on the nodes.
>>
>> On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ 
>> wrote:
>>
>>> Hello Carl,
>>>
>>> I guess we can use bean_regex to do specific targetted metrics for the
>>>> important tables anyway.
>>>>
>>>
>>> Yes, this would work, but 350 is very limited for Cassandra dashboards.
>>> We have a LOT of metrics available.
>>>
>>> Datadog 350 metric limit is a PITA for tables once you get over 10 tables
>>>>
>>>
>>> I noticed this while I was working on providing default dashboards for
>>> Cassandra-Datadog integration. I was told by Datadog team it would not be
>>> an issue for users, that I should not care about it. As you pointed out,
>>> per table metrics quickly increase the total number of metrics we need to
>>> collect.
>>>
>>> I believe you can set the following option: *"max_returned_metrics:
>>> 1000"* - it can be used if metrics are missing to increase the limit of
>>> the number of collected metrics. Be aware of CPU utilization that this
>>> might imply (greatly improved in dd-agent version 6+ I believe -thanks
>>> Datadog teams for that- making this fully usable for Cassandra). This
>>> option should go in the *cassandra.yaml* file for Cassandra
>>> integrations, off the top of my head.
>>>
>>> Also, do not hesitate to reach to Datadog directly for this kind of
>>> questions, I have always been very happy with their support so far, I 

Re: rolling version upgrade, upgradesstables, and vulnerability window

2018-10-30 Thread Carl Mueller
But the topology change restrictions are only in place while there are
heterogenous versions in the cluster? All the nodes at the upgraded version
with "degraded" sstables does NOT preclude topology changes or node
replacement/addition?


On Tue, Oct 30, 2018 at 10:33 AM Jeff Jirsa  wrote:

> Wait for 3.11.4 to be cut
>
> I also vote for doing all the binary bounces and upgradesstables after the
> fact, largely because normal writes/compactions are going to naturally
> start upgrading sstables anyway, and there are some hard restrictions on
> mixed mode (e.g. schema changes won’t cross version) that can be far more
> impactful.
>
>
>
> --
> Jeff Jirsa
>
>
> > On Oct 30, 2018, at 8:21 AM, Carl Mueller 
> > 
> wrote:
> >
> > We are about to finally embark on some version upgrades for lots of
> clusters, 2.1.x and 2.2.x targetting eventually 3.11.x
> >
> > I have seen recipes that do the full binary upgrade + upgrade sstables
> for 1 node before moving forward, while I've seen a 2016 vote by Jon Haddad
> (a TLP guy) that backs doing the binary version upgrades through the
> cluster on a rolling basis, then doing the upgradesstables on a rolling
> basis.
> >
> > Under what cluster conditions are streaming/node replacement precluded,
> that is we are vulnerable to a cloud provided dumping one of our nodes
> under us or hardware failure? We ain't apple, but we do have 30+ node
> datacenters and 80-100 node clusters.
> >
> > Is the node replacement and streaming only disabled while there are
> heterogenous cassandra versions, or until all the sstables have been
> upgraded in the cluster?
> >
> > My instincts tell me the best thing to do is to get all the cassandra
> nodes to the same version without the upgradesstables step through the
> cluster, and then roll through the upgradesstables as needed, and that
> upgradesstables is a node-local concern that doesn't impact streaming or
> node replacement or other situations since cassandra can read old version
> sstables and new sstables would simply be the new format.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: rolling version upgrade, upgradesstables, and vulnerability window

2018-10-30 Thread Carl Mueller
Thank you very much. I couldn't find any definitive answer on that on the
list or stackoverflow.

It's clear that the safest for a prod cluster is rolling version upgrade of
the binary, then the upgradesstables.

I will strongly consider cstar for the upgradesstables


On Tue, Oct 30, 2018 at 10:39 AM Alexander Dejanovski <
a...@thelastpickle.com> wrote:

> Yes, as the new version can read both the old and the new sstables format.
>
> Restrictions only apply when the cluster is in mixed versions.
>
> On Tue, Oct 30, 2018 at 4:37 PM Carl Mueller
>  wrote:
>
>> But the topology change restrictions are only in place while there are
>> heterogenous versions in the cluster? All the nodes at the upgraded version
>> with "degraded" sstables does NOT preclude topology changes or node
>> replacement/addition?
>>
>>
>> On Tue, Oct 30, 2018 at 10:33 AM Jeff Jirsa  wrote:
>>
>>> Wait for 3.11.4 to be cut
>>>
>>> I also vote for doing all the binary bounces and upgradesstables after
>>> the fact, largely because normal writes/compactions are going to naturally
>>> start upgrading sstables anyway, and there are some hard restrictions on
>>> mixed mode (e.g. schema changes won’t cross version) that can be far more
>>> impactful.
>>>
>>>
>>>
>>> --
>>> Jeff Jirsa
>>>
>>>
>>> > On Oct 30, 2018, at 8:21 AM, Carl Mueller <
>>> carl.muel...@smartthings.com.INVALID> wrote:
>>> >
>>> > We are about to finally embark on some version upgrades for lots of
>>> clusters, 2.1.x and 2.2.x targetting eventually 3.11.x
>>> >
>>> > I have seen recipes that do the full binary upgrade + upgrade sstables
>>> for 1 node before moving forward, while I've seen a 2016 vote by Jon Haddad
>>> (a TLP guy) that backs doing the binary version upgrades through the
>>> cluster on a rolling basis, then doing the upgradesstables on a rolling
>>> basis.
>>> >
>>> > Under what cluster conditions are streaming/node replacement
>>> precluded, that is we are vulnerable to a cloud provided dumping one of our
>>> nodes under us or hardware failure? We ain't apple, but we do have 30+ node
>>> datacenters and 80-100 node clusters.
>>> >
>>> > Is the node replacement and streaming only disabled while there are
>>> heterogenous cassandra versions, or until all the sstables have been
>>> upgraded in the cluster?
>>> >
>>> > My instincts tell me the best thing to do is to get all the cassandra
>>> nodes to the same version without the upgradesstables step through the
>>> cluster, and then roll through the upgradesstables as needed, and that
>>> upgradesstables is a node-local concern that doesn't impact streaming or
>>> node replacement or other situations since cassandra can read old version
>>> sstables and new sstables would simply be the new format.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>


rolling version upgrade, upgradesstables, and vulnerability window

2018-10-30 Thread Carl Mueller
We are about to finally embark on some version upgrades for lots of
clusters, 2.1.x and 2.2.x targetting eventually 3.11.x

I have seen recipes that do the full binary upgrade + upgrade sstables for
1 node before moving forward, while I've seen a 2016 vote by Jon Haddad (a
TLP guy) that backs doing the binary version upgrades through the cluster
on a rolling basis, then doing the upgradesstables on a rolling basis.

Under what cluster conditions are streaming/node replacement precluded,
that is we are vulnerable to a cloud provided dumping one of our nodes
under us or hardware failure? We ain't apple, but we do have 30+ node
datacenters and 80-100 node clusters.

Is the node replacement and streaming only disabled while there are
heterogenous cassandra versions, or until all the sstables have been
upgraded in the cluster?

My instincts tell me the best thing to do is to get all the cassandra nodes
to the same version without the upgradesstables step through the cluster,
and then roll through the upgradesstables as needed, and that
upgradesstables is a node-local concern that doesn't impact streaming or
node replacement or other situations since cassandra can read old version
sstables and new sstables would simply be the new format.


comprehensive list of checks before rolling version upgrades

2018-10-30 Thread Carl Mueller
Does anyone have a pretty comprehensive list of these? Many that I don't
currently know how to check but I'm researching...

I've seen:

- verify disk space available for snapshot + sstablerewrite
- gossip state agreement, all nodes are healthy
- schema state agreement
- ability to access all the nodes
- no repairs, upgradesstables, and cleans underway
- read repair/hinted handoff is not backed up

Other possibles:
- repair state? can we get away with unrepaired data?
- pending tasks?
- streaming state/tasks?


Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-10-04 Thread Carl Mueller
for 2.1.x we had a custom reporter that delivered  metrics to datadog's
endpoint via https, bypassing the agent-imposed 350. But integrating that
required targetting the other shared libs in the cassandra path, so the
build is a bit of a pain when we update major versions.

We are migrating our 2.1.x specific dashboards, and we will use
agent-delivered metrics for non-table, and adapt the custom library to
deliver the table-based ones, at a slower rate than the "core" ones.

Datadog is also super annoying because there doesn't appear to be anything
that reports what metrics the agent is sending (the metric count can
indicate if a configured new metric increased the count and is being
reported, but it's still... a guess), and there is re-namespacing of the
metric names in lots of cases, and these don't appear to be centrally
documented, but maybe i haven't found the magic page.

There are so many good stats. We might also implement some facility to
dynamically turn on the delivery of detailed metrics on the nodes.

On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ  wrote:

> Hello Carl,
>
> I guess we can use bean_regex to do specific targetted metrics for the
>> important tables anyway.
>>
>
> Yes, this would work, but 350 is very limited for Cassandra dashboards. We
> have a LOT of metrics available.
>
> Datadog 350 metric limit is a PITA for tables once you get over 10 tables
>>
>
> I noticed this while I was working on providing default dashboards for
> Cassandra-Datadog integration. I was told by Datadog team it would not be
> an issue for users, that I should not care about it. As you pointed out,
> per table metrics quickly increase the total number of metrics we need to
> collect.
>
> I believe you can set the following option: *"max_returned_metrics: 1000"* -
> it can be used if metrics are missing to increase the limit of the number
> of collected metrics. Be aware of CPU utilization that this might imply
> (greatly improved in dd-agent version 6+ I believe -thanks Datadog teams
> for that- making this fully usable for Cassandra). This option should go in
> the *cassandra.yaml* file for Cassandra integrations, off the top of my
> head.
>
> Also, do not hesitate to reach to Datadog directly for this kind of
> questions, I have always been very happy with their support so far, I am
> sure they would guide you through this as well, probably better than we can
> do :). It also provides them with feedback on what people are struggling
> with I imagine.
>
> I am interested to know if you still have issues getting more metrics
> (option above not working / CPU under too much load) as this would make the
> dashboards we built mostly unusable for clusters with more tables. We might
> then need to review the design.
>
> As a side note, I believe metrics are handled the same way cross version,
> they got the same name/label for C*2.1, 2.2 and 3+ on Datadog. There is an
> abstraction layer that removes this complexity (if I remember well, we
> built those dashboards a while ago).
>
> C*heers
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le lun. 1 oct. 2018 à 19:38, Carl Mueller
>  a écrit :
>
>> That's great too, thank you.
>>
>> Datadog 350 metric limit is a PITA for tables once you get over 10
>> tables, but I guess we can use bean_regex to do specific targetted metrics
>> for the important tables anyway.
>>
>> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ 
>> wrote:
>>
>>> Hello Carl,
>>>
>>> Here is a message I sent to my team a few months ago. I hope this will
>>> be helpful to you and more people around :). It might not be exhaustive and
>>> we were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but
>>> C*2.2 is similar to C*3.0 if I remember correctly in terms of metrics. Here
>>> it is for what it's worth:
>>>
>>> Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
>>> - ColumnFamily --> Table
>>> - XXpercentile --> pXX
>>> - 1MinuteRate -->  m1_rate
>>> - metric name before KS and Table names and some other changes of this
>>> kind.
>>> - ^ aggregations / aliases indexes changed because of this (using
>>> graphite for example) ^
>>> - ‘.value’ is not appended in the metric name anymore for gauges,
>>> nothing instead.
>>>
>>> For example (graphite):
>>>
>>> From
>>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$tab

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-09-28 Thread Carl Mueller
VERY NICE! Thank you very much

On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
lyuben.todo...@instaclustr.com> wrote:

> Nothing as fancy as a matrix but a list of what JMX term can see.
> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>
> /lyubent
>
> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>  wrote:
>
>> It's my understanding that metrics got heavily re-namespaced in JMX for
>> 2.2 from 2.1
>>
>> Did anyone ever make a migration matrix/guide for conversion of old
>> metrics to new metrics?
>>
>>
>>


Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-10-01 Thread Carl Mueller
That's great too, thank you.

Datadog 350 metric limit is a PITA for tables once you get over 10 tables,
but I guess we can use bean_regex to do specific targetted metrics for the
important tables anyway.

On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ  wrote:

> Hello Carl,
>
> Here is a message I sent to my team a few months ago. I hope this will be
> helpful to you and more people around :). It might not be exhaustive and we
> were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but C*2.2
> is similar to C*3.0 if I remember correctly in terms of metrics. Here it is
> for what it's worth:
>
> Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
> - ColumnFamily --> Table
> - XXpercentile --> pXX
> - 1MinuteRate -->  m1_rate
> - metric name before KS and Table names and some other changes of this
> kind.
> - ^ aggregations / aliases indexes changed because of this (using graphite
> for example) ^
> - ‘.value’ is not appended in the metric name anymore for gauges, nothing
> instead.
>
> For example (graphite):
>
> From
> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
> 2, 3), 1, 7, 8, 9)
>
> to
> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
> 2, 3), 1, 8, 9, 10)
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
>  a écrit :
>
>> VERY NICE! Thank you very much
>>
>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>> lyuben.todo...@instaclustr.com> wrote:
>>
>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>
>>> /lyubent
>>>
>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>>  wrote:
>>>
>>>> It's my understanding that metrics got heavily re-namespaced in JMX for
>>>> 2.2 from 2.1
>>>>
>>>> Did anyone ever make a migration matrix/guide for conversion of old
>>>> metrics to new metrics?
>>>>
>>>>
>>>>


Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-09-28 Thread Carl Mueller
It's my understanding that metrics got heavily re-namespaced in JMX for 2.2
from 2.1

Did anyone ever make a migration matrix/guide for conversion of old metrics
to new metrics?


Re: Released an ACID-compliant transaction library on top of Cassandra

2019-01-16 Thread Carl Mueller
"2) Overview: In essence, the protocol calls for each data item to maintain
the last committed and perhaps also the currently active version, for the
data and relevant metadata. Each version is tagged with meta-data
pertaining to the transaction that created it. This includes the
transaction commit time and transaction identifier that created it,
pointing to a globally visible transaction status record (TSR) using a
Universal Resource Identifier (URI). The TSR is used by the client to
determine which version of the data item to use when reading it, and so
that transaction commit can happen just by updating (in one step) the TSR.
The transaction identifier, stored in the form of a URI, allows any client
regardless of its location to inspect it the TSR in order to determine the
transaction commitment state. Using the status of the TSR, any failure can
be either rolled forward to the later version, or rolled back to the
previous version. The test-andset capability on each item is used to
determine a consistent winner when multiple transactions attempt concurrent
activity on a conflicting set of items. A global order is put on the
records, through a consistent hash of the record identifiers, and used when
updating in order to prevent deadlocks. This approach is optimized to
permit parallel processing of the commit activity."

It seems to be a sort of log-structure/append/change tracking storage where
multiple versions of the data to be updated are tracked as transactions are
applied to them, and therefore can be rolled back.

Probably all active versions are read and then reduced to the final product
once all transactions are accounted for.

Of course you can't have perpetual transaction changes stored so they must
be ... compacted ... at some point?

 which is basically what cassandra does at the node level in the
read/write path with LSM, bloom filters, and merging data across disparate
sstables...?

The devil is in the details of these things of course. Is that about right?

On Tue, Nov 13, 2018 at 9:54 AM Ariel Weisberg  wrote:

> Hi,
>
> Fanastic news!
>
> Ariel
>
> On Tue, Nov 13, 2018, at 10:36 AM, Hiroyuki Yamada wrote:
> > Hi all,
> >
> > I am happy to release it under Apache 2 license now.
> > https://github.com/scalar-labs/scalardb
> >
> > It passes not only jepsen but also our-built destructive testing.
> > For jepsen tests, please check the following.
> > https://github.com/scalar-labs/scalardb/tree/master/jepsen/scalardb
> >
> > Also, as Yuji mentioned the other day, we also fixed/updated jepsen
> > tests for C* to make it work with the latest C* version properly and
> > follow the new style.
> > https://github.com/scalar-labs/jepsen/tree/cassandra
> >
> > In addition to that, we fixed/updated cassaforte used in the jepsen
> > tests for C* to make it work with the latest java driver since
> > cassaforte is not really maintained any more.
> > https://github.com/scalar-labs/cassaforte/tree/driver-3.0-for-jepsen
> >
> > We are pleased to be able to contribute to the community by the above
> updates.
> > Please give us any feedbacks or questions.
> >
> > Thanks,
> > Hiro
> >
> >
> > On Wed, Oct 17, 2018 at 8:52 AM Hiroyuki Yamada 
> wrote:
> > >
> > > Hi all,
> > >
> > > Thank you for the comments and feedbacks.
> > >
> > > As Jonathan pointed out, it relies on LWT and uses the protocol
> > > proposed in the paper.
> > > Please read the design document for more detail.
> > > https://github.com/scalar-labs/scalardb/blob/master/docs/design.md
> > >
> > > Regarding the licensing, we are thinking of releasing it with Apache 2
> > > if lots of developers are interested in it.
> > >
> > > Best regards,
> > > Hiroyuki
> > > On Wed, Oct 17, 2018 at 3:13 AM Jonathan Ellis 
> wrote:
> > > >
> > > > Which was followed up by
> https://www.researchgate.net/profile/Akon_Dey/publication/282156834_Scalable_Distributed_Transactions_across_Heterogeneous_Stores/links/56058b9608ae5e8e3f32b98d.pdf
> > > >
> > > > On Tue, Oct 16, 2018 at 1:02 PM Jonathan Ellis 
> wrote:
> > > >>
> > > >> It looks like it's based on this:
> http://www.vldb.org/pvldb/vol6/p1434-dey.pdf
> > > >>
> > > >> On Tue, Oct 16, 2018 at 11:37 AM Ariel Weisberg 
> wrote:
> > > >>>
> > > >>> Hi,
> > > >>>
> > > >>> Yes this does sound great. Does this rely on Cassandra's internal
> SERIAL consistency and CAS functionality or is that implemented at a higher
> level?
> > > >>>
> > > >>> Regards,
> > > >>> Ariel
> > > >>>
> > > >>> On Tue, Oct 16, 2018, at 12:31 PM, Jeff Jirsa wrote:
> > > >>> > This is great!
> > > >>> >
> > > >>> > --
> > > >>> > Jeff Jirsa
> > > >>> >
> > > >>> >
> > > >>> > > On Oct 16, 2018, at 5:47 PM, Hiroyuki Yamada <
> mogwa...@gmail.com> wrote:
> > > >>> > >
> > > >>> > > Hi all,
> > > >>> > >
> > > >>> > > # Sorry, I accidentally emailed the following to dev@, so
> re-sending to here.
> > > >>> > >
> > > >>> > > We have been working on ACID-compliant transaction library on
> top of
> > > >>> > > Cassandra called Scalar DB,

Re: Best practices while designing backup storage system for big Cassandra cluster

2019-04-02 Thread Carl Mueller
Another approach to avoiding the full backup I/O hit would be to rotate a
node or small subset of nodes that do full backups routinely, so that over
the course of a month or two you get full backups. Of course this assumes
you have incremental ability for the other backup days/dates.

On Mon, Apr 1, 2019 at 1:30 PM Carl Mueller 
wrote:

> At my current job I had to roll my own backup system. Hopefully I can get
> it OSS'd at some point. Here is a (now slightly outdated) presentation:
>
>
> https://docs.google.com/presentation/d/13Aps-IlQPYAa_V34ocR0E8Q4C8W2YZ6Jn5_BYGrjqFk/edit#slide=id.p
>
> If you are struggling with the disk I/O cost of the sstable
> backups/copies, note that since sstables are append-only, if you adopt an
> incremental approach to your backups, you only need to track a list of the
> current files and upload the files that are new compared to a previous
> successful backup. Your "manifest" of files for a node will need to have
> references to the previous backup, and you'll wnat to "reset" with a full
> backup each month.
>
> I stole that idea from https://github.com/tbarbugli/cassandra_snapshotter.
> I would have used that but we had more complex node access modes
> (kubernetes, ssh through jumphosts, etc) and lots of other features needed
> that weren't supported.
>
> In AWS I use aws profiles to throttle the transfers, and parallelize
> across nodes. The basic unit of a successful backup is a single node, but
> you'll obviously want to track overall node success.
>
> Note that in rack-based topologies you really only need one whole
> successful rack if your RF is > # racks, and one DC.
>
> Beware doing simultaneous flushes/snapshots across the cluster at once,
> that might be the equivalent of a DDos. You might want to do a "jittered"
> randomized preflush of the cluster first before doing the snapshotting.
>
> Unfortunately, the nature of a distributed system is that snapshotting all
> the nodes at the precise same time is a hard problem.
>
> I also do not / have not used the built-in incremental backup feature of
> cassandra, which can enable more precise point-in-time backups (aside from
> the unflushed data in the commitlogs)
>
> A note on incrementals with occaisional FULLs: Note that FULL backups
> monthly might take more than a day or two, especially throttled. My
> incrementals were originally looking up previous manifests using only 'most
> recent", but then the long-running FULL backups were excluded from the
> "chain" of incremental backups. So I now implement a fuzzy lookup for the
> incrementals that prioritizes any FULL in the last 5 days over any more
> recent incremental. Thus you can purge old backups you don't need more
> safely using the monthly full backups as a reset point.
>
> On Mon, Apr 1, 2019 at 1:08 PM Alain RODRIGUEZ  wrote:
>
>> Hello Manish,
>>
>> I think any disk works. As long as it is big enough. It's also better if
>> it's a reliable system (some kind of redundant raid, NAS, storage like GCS
>> or S3...). We are not looking for speed mostly during a backup, but
>> resiliency and not harming the source cluster mostly I would say.
>> Then how fast you write to the backup storage system will probably be
>> more often limited by what you can read from the source cluster.
>> The backups have to be taken from running nodes, thus it's easy to
>> overload the disk (reads), network (export backup data to final
>> destination), and even CPU (as/if the machine handles the transfer).
>>
>> What are the best practices while designing backup storage system for big
>>> Cassandra cluster?
>>
>>
>> What is nice to have (not to say mandatory) is a system of incremental
>> backups. You should not take the data from the nodes every time, or you'll
>> either harm the cluster regularly OR spend days to transfer the data (if
>> the amount of data grows big enough).
>> I'm not speaking about Cassandra incremental snapshots, but of using
>> something like AWS Snapshot, or copying this behaviour programmatically to
>> take (copy, link?) old SSTables from previous backups when they exist, will
>> greatly unload the clusters work and the resource needed as soon enough a
>> substantial amount of the data should be coming from the backup data source
>> itself. The problem with incremental snapshot is that when restoring, you
>> have to restore multiple pieces, making it harder and involving a lot of
>> compaction work.
>> The "caching" technic mentioned above gives the best of the 2 worlds:
>> - You will always backup from the nodes only the sstables you don’t have
>> already in your backup stor

Re: Merging two cluster's in to one without any downtime

2019-03-25 Thread Carl Mueller
Either:

double-write at the driver level from one of the apps and perform an
initial and a subsequent sstable loads (or whatever ETL method you want to
use) to merge the data with good assurances.

use a trigger to replicate the writes, with some sstable loads / ETL.

use change data capture with some sstable loads/ETL

On Mon, Mar 25, 2019 at 5:48 PM Nick Hatfield 
wrote:

> Maybe others will have a different or better solution but, in my
> experience to accomplish HA we simply y write from our application to the
> new cluster. You then export the data from the old cluster using cql2json
> or any method you choose, to the new cluster. That will cover all live(now)
> data via y write, while supplying the old data from the copy you run. Once
> complete, set up a single reader that reads data from the new cluster and
> verify all is as expected!
>
>
> Sent from my BlackBerry 10 smartphone on the Verizon Wireless 4G LTE network.
> *From: *Nandakishore Tokala
> *Sent: *Monday, March 25, 2019 18:39
> *To: *user@cassandra.apache.org
> *Reply To: *user@cassandra.apache.org
> *Subject: *Merging two cluster's in to one without any downtime
>
> Please let me know the best practices to combine 2 different cluster's
> into one without having any downtime.
>
> Thanks & Regards,
> Nanda Kishore
>


upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-25 Thread Carl Mueller
This is a multi-dc cluster with public IPs for the nodes and also addressed
with private IPs as well in AWS. The apps connect via java-driver to a
public IP.

When we built the 2.1.X cluster with ec2multiregionsnitch, the system.peers
table had public ips for the nodes in the rpc_address column.

After the upgrade from 2.1.x to 2.2.x, the java-driver and/or cassandra
appears to be resolving to internal private IPs when building the cluster
map on the client. The system.peers table now had internal/private IPs in
the rpc_address column.

Since the internal IPs are given when the client app connects to the
cluster, the client app cannot communicate with other nodes in other
datacenters. They seem to be able to communicate within its own datacenter
of the initial connection.

It appears we fixed this by manually updating the system.peers table's
rpc_address column back to the public IP. This appears to survive a restart
of the cassandra nodes without being switched back to private IPs.

The gossipinfo after the 2.1 --> 2.2 upgrade reports internal IPs for both
RPC_ADDRESS and INTERNAL_IP, while in the 2.1.x version of gossipinfo it
reported RPC_ADDRESS to be the public IP and INTERNAL_IP to the internal
ip.

Rolling restarts did not solve this either, only our manual updates to
system.peers.

Our cassandra.yaml (these parameters are the same in our confs for 2.1 and
2.2) has:

listen_address: internal aws vpc ip
rpc_address: 0.0.0.0
broadcast_rpc_address: internal aws vpc ip

Are there changes to ec2multiregionsnitch or the java-driver binary
protocol that requires additional changes? Did the resolution of addresses
based on cassandra yaml parameters change?

So for more reference, here is the system.peers table in initial state when
the cluster is 2.1.x, with the rpc_address showing public ips.

peer   | data_center | preferred_ip   | rack | release_version |
rpc_address| schema_version
+-++--+-++--
public_ip_1| us-east | private_ip_1   |   1c |  2.1.9 |
public_ip_1| 65398421-84e8-307f-ae52-f6da42ff70c3
public_ip_2| us-east | private_ip_2   |   1e |  2.1.9 |
public_ip_2| 65398421-84e8-307f-ae52-f6da42ff70c3
public_ip_3| us-east | private_ip_3   |   1d |  2.1.9 |
public_ip_3| 65398421-84e8-307f-ae52-f6da42ff70c3
public_ip_4| eu-west |   null |   1a |  2.1.9 |
public_ip_4| 65398421-84e8-307f-ae52-f6da42ff70c3
public_ip_5| eu-west |   null |   1b |  2.1.9 |
public_ip_5| 65398421-84e8-307f-ae52-f6da42ff70c3
public_ip_6| us-east | private_ip_6   |   1e |  2.1.9 |
public_ip_6| 65398421-84e8-307f-ae52-f6da42ff70c3
public_ip_7| eu-west |   null |   1b |  2.1.9 |
public_ip_7| 65398421-84e8-307f-ae52-f6da42ff70c3
public_ip_8| eu-west |   null |   1c |  2.1.9 |
public_ip_8| 65398421-84e8-307f-ae52-f6da42ff70c3
public_ip_9| eu-west |   null |   1a |  2.1.9 |
public_ip_9| 65398421-84e8-307f-ae52-f6da42ff70c3


THEN after we upgraded to 2.2.X, note the change of the addresses in
rpc_address to the private ones:

peer   | data_center | preferred_ip   | rack | release_version |
rpc_address| schema_version
+-++--+-++--
public_ip_1| us-east | private_ip_1   |   1c |  2.2.13 |
private_ip_1   | 89b260c9-70c1-3119-b37a-30b464851c9f
public_ip_2| us-east | private_ip_2   |   1e |  2.2.13 |
private_ip_2   | 89b260c9-70c1-3119-b37a-30b464851c9f
public_ip_3| us-east | private_ip_3   |   1d |  2.2.13 |
private_ip_3   | 89b260c9-70c1-3119-b37a-30b464851c9f
public_ip_4| eu-west |   null |   1a |  2.2.13 |
private_ip_4   | 89b260c9-70c1-3119-b37a-30b464851c9f
public_ip_5| eu-west |   null |   1b |  2.2.13 |
private_ip_5   | 89b260c9-70c1-3119-b37a-30b464851c9f
public_ip_6| us-east | private_ip_6   |   1e |  2.2.13 |
private_ip_6   | 89b260c9-70c1-3119-b37a-30b464851c9f
public_ip_7| eu-west |   null |   1b |  2.2.13 |
private_ip_7   | 89b260c9-70c1-3119-b37a-30b464851c9f
public_ip_8| eu-west |   null |   1c |  2.2.13 |
private_ip_8   | 89b260c9-70c1-3119-b37a-30b464851c9f
public_ip_9| eu-west |   null |   1a |  2.2.13 |
private_ip_9   | 89b260c9-70c1-3119-b37a-30b464851c9f

So we had to manually update the system.peers rpc_address back to the
public ips to get the java-driver to work again.


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-27 Thread Carl Mueller
We are probably going to just have a VM startup script for now that
automatically updates the yaml on instance restart. It seems to be the
least-sucky approach at this point.

On Wed, Mar 27, 2019 at 12:36 PM Carl Mueller 
wrote:

> I filed https://issues.apache.org/jira/browse/CASSANDRA-15068
>
> EIPs per the aws experts cost money, are limited in resources (we have a
> lot of VMs) and cause a lot of headaches in our autoscaling /
> infrastructure as code systems.
>
> On Wed, Mar 27, 2019 at 12:35 PM Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> I'll try to get a replicated error message, but it was along the lines of
>> what is in the gossip strategy agnostic description in cassandra.yaml
>> comments of what happens when you set rpc_address to 0.0.0.0: you must
>> then set broadcast_rpc_address.
>>
>> On Wed, Mar 27, 2019 at 3:21 AM Oleksandr Shulgin <
>> oleksandr.shul...@zalando.de> wrote:
>>
>>> On Tue, Mar 26, 2019 at 10:28 PM Carl Mueller
>>>  wrote:
>>>
>>>> - the AWS people say EIPs are a PITA.
>>>>
>>>
>>> Why?
>>>
>>>
>>>> - if we hardcode the global IPs in the yaml, then yaml editing is
>>>> required for the occaisional hard instance reboot in aws and its attendant
>>>> global ip reassignment
>>>> - if we try leaving broadcast_rpc_address blank, null , or commented
>>>> out with rpc_address set to 0.0.0.0 then cassandra refuses to start
>>>>
>>>
>>> Yeah, that's not nice.
>>>
>>> - if we take out rpc_address and broadcast_rpc_address, then cqlsh
>>>> doesn't work with localhost anymore and that fucks up some of our cluster
>>>> managemetn tooling
>>>>
>>>> - we kind of are being lazy and just want what worked in 2.1 to work in
>>>> 2.2
>>>>
>>>
>>> Makes total sense to me.
>>>
>>> I'll try to track down where cassandra startup is complaining to us
>>>> about rpc_address: 0.0.0.0 and broadcast_rpc_address being
>>>> blank/null/commented out. That section of code may need an exception for
>>>> EC2MRS.
>>>>
>>>
>>> It sounds like this check is done before instantiating the snitch and it
>>> should be other way round, so that the snitch can have a chance to adjust
>>> the configuration before it's checked for correctness.  Do you have the
>>> exact error message with which it complains?
>>>
>>> --
>>> Alex
>>>
>>>


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-27 Thread Carl Mueller
I filed https://issues.apache.org/jira/browse/CASSANDRA-15068

EIPs per the aws experts cost money, are limited in resources (we have a
lot of VMs) and cause a lot of headaches in our autoscaling /
infrastructure as code systems.

On Wed, Mar 27, 2019 at 12:35 PM Carl Mueller 
wrote:

> I'll try to get a replicated error message, but it was along the lines of
> what is in the gossip strategy agnostic description in cassandra.yaml
> comments of what happens when you set rpc_address to 0.0.0.0: you must
> then set broadcast_rpc_address.
>
> On Wed, Mar 27, 2019 at 3:21 AM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Tue, Mar 26, 2019 at 10:28 PM Carl Mueller
>>  wrote:
>>
>>> - the AWS people say EIPs are a PITA.
>>>
>>
>> Why?
>>
>>
>>> - if we hardcode the global IPs in the yaml, then yaml editing is
>>> required for the occaisional hard instance reboot in aws and its attendant
>>> global ip reassignment
>>> - if we try leaving broadcast_rpc_address blank, null , or commented out
>>> with rpc_address set to 0.0.0.0 then cassandra refuses to start
>>>
>>
>> Yeah, that's not nice.
>>
>> - if we take out rpc_address and broadcast_rpc_address, then cqlsh
>>> doesn't work with localhost anymore and that fucks up some of our cluster
>>> managemetn tooling
>>>
>>> - we kind of are being lazy and just want what worked in 2.1 to work in
>>> 2.2
>>>
>>
>> Makes total sense to me.
>>
>> I'll try to track down where cassandra startup is complaining to us about
>>> rpc_address: 0.0.0.0 and broadcast_rpc_address being blank/null/commented
>>> out. That section of code may need an exception for EC2MRS.
>>>
>>
>> It sounds like this check is done before instantiating the snitch and it
>> should be other way round, so that the snitch can have a chance to adjust
>> the configuration before it's checked for correctness.  Do you have the
>> exact error message with which it complains?
>>
>> --
>> Alex
>>
>>


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-27 Thread Carl Mueller
I'll try to get a replicated error message, but it was along the lines of
what is in the gossip strategy agnostic description in cassandra.yaml
comments of what happens when you set rpc_address to 0.0.0.0: you must then
set broadcast_rpc_address.

On Wed, Mar 27, 2019 at 3:21 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Tue, Mar 26, 2019 at 10:28 PM Carl Mueller
>  wrote:
>
>> - the AWS people say EIPs are a PITA.
>>
>
> Why?
>
>
>> - if we hardcode the global IPs in the yaml, then yaml editing is
>> required for the occaisional hard instance reboot in aws and its attendant
>> global ip reassignment
>> - if we try leaving broadcast_rpc_address blank, null , or commented out
>> with rpc_address set to 0.0.0.0 then cassandra refuses to start
>>
>
> Yeah, that's not nice.
>
> - if we take out rpc_address and broadcast_rpc_address, then cqlsh doesn't
>> work with localhost anymore and that fucks up some of our cluster
>> managemetn tooling
>>
>> - we kind of are being lazy and just want what worked in 2.1 to work in
>> 2.2
>>
>
> Makes total sense to me.
>
> I'll try to track down where cassandra startup is complaining to us about
>> rpc_address: 0.0.0.0 and broadcast_rpc_address being blank/null/commented
>> out. That section of code may need an exception for EC2MRS.
>>
>
> It sounds like this check is done before instantiating the snitch and it
> should be other way round, so that the snitch can have a chance to adjust
> the configuration before it's checked for correctness.  Do you have the
> exact error message with which it complains?
>
> --
> Alex
>
>


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-26 Thread Carl Mueller
Looking at the code it appears it shouldn't matter what we set the yaml
params to. The Ec2MultiRegionSnitch should be using the aws metadata
169.254.169.254 to pick up the internal/external ips as needed.

I think I'll just have to dig in to the code differences between 2.1 and
2.2. We don't want to specify the glboal IP in any of the yaml fields
because the global IP for the instance changes if we do an aws instance
restart. Don't want yaml editing to be a part of the instance restart
process.

And I was misinformed, an instance restart in our 2.2 cluster does
overwrite the manual system.peers entries, which I expected to happen.

On Tue, Mar 26, 2019 at 3:33 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Mon, Mar 25, 2019 at 11:13 PM Carl Mueller
>  wrote:
>
>>
>> Since the internal IPs are given when the client app connects to the
>> cluster, the client app cannot communicate with other nodes in other
>> datacenters.
>>
>
> Why should it?  The client should only connect to its local data center
> and leave communication with remote DCs to the query coordinator.
>
>
>> They seem to be able to communicate within its own datacenter of the
>> initial connection.
>>
>
> Did you configure address translation on the client?  See:
> https://docs.datastax.com/en/developer/java-driver/3.0/manual/address_resolution/#ec2-multi-region
>
> It appears we fixed this by manually updating the system.peers table's
>> rpc_address column back to the public IP. This appears to survive a restart
>> of the cassandra nodes without being switched back to private IPs.
>>
>
> I don't think updating system tables is a supported solution.  I'm
> surprised that even doesn't give you an error.
>
> Our cassandra.yaml (these parameters are the same in our confs for 2.1 and
>> 2.2) has:
>>
>> listen_address: internal aws vpc ip
>> rpc_address: 0.0.0.0
>> broadcast_rpc_address: internal aws vpc ip
>>
>
> It is not straightforward to find the docs for version 2.x anymore, but at
> least for 3.0 it is documented that you should set broadcast_rpc_address to
> the public IP:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archSnitchEC2MultiRegion.html
>
> Regards,
> --
> Alex
>
>


Re: upgrading 2.1.x cluster with ec2multiregionsnitch system.peers "corruption"

2019-03-26 Thread Carl Mueller
- the AWS people say EIPs are a PITA.
- if we hardcode the global IPs in the yaml, then yaml editing is required
for the occaisional hard instance reboot in aws and its attendant global ip
reassignment
- if we try leaving broadcast_rpc_address blank, null , or commented out
with rpc_address set to 0.0.0.0 then cassandra refuses to start
- if we take out rpc_address and broadcast_rpc_address, then cqlsh doesn't
work with localhost anymore and that fucks up some of our cluster
managemetn tooling

- we kind of are being lazy and just want what worked in 2.1 to work in 2.2

Ok, the code in 2.1:
https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/locator/Ec2MultiRegionSnitch.java

Of interest

DatabaseDescriptor.setBroadcastAddress(localPublicAddress);
DatabaseDescriptor.setBroadcastRpcAddress(localPublicAddress);

The code in 2.2+:
https://github.com/apache/cassandra/blob/cassandra-2.2/src/java/org/apache/cassandra/locator/Ec2MultiRegionSnitch.java
<https://github.com/apache/cassandra/blob/cassandra-2.1/src/java/org/apache/cassandra/locator/Ec2MultiRegionSnitch.java>

Becomes

DatabaseDescriptor.setBroadcastAddress(localPublicAddress); if
(DatabaseDescriptor.getBroadcastRpcAddress() == null) {
logger.info("broadcast_rpc_address
unset, broadcasting public IP as rpc_address: {}", localPublicAddress);
DatabaseDescriptor.setBroadcastRpcAddress(localPublicAddress); }
And that if clause added as part of a CASSANDRA-11356 patch, is what is
submarining us. I don't know the otherwise intricacies of the various
address settings in the yaml vis-a-vis EC2MRS, but since we can't configure
it the good-old-2.1-way in 2.2+, this seems broken to us.

I'll try to track down where cassandra startup is complaining to us about
rpc_address: 0.0.0.0 and broadcast_rpc_address being blank/null/commented
out. That section of code may need an exception for EC2MRS.



On Tue, Mar 26, 2019 at 12:01 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Tue, Mar 26, 2019 at 5:49 PM Carl Mueller
>  wrote:
>
>> Looking at the code it appears it shouldn't matter what we set the yaml
>> params to. The Ec2MultiRegionSnitch should be using the aws metadata
>> 169.254.169.254 to pick up the internal/external ips as needed.
>>
>
> This is somehow my expectation as well, so maybe the docs are just
> outdated.
>
> I think I'll just have to dig in to the code differences between 2.1 and
>> 2.2. We don't want to specify the glboal IP in any of the yaml fields
>> because the global IP for the instance changes if we do an aws instance
>> restart. Don't want yaml editing to be a part of the instance restart
>> process.
>>
>
> We did solve this in the past by using Elastic IPs: anything prevents you
> from using those?
>
> --
> Alex
>
>


Re: Best practices while designing backup storage system for big Cassandra cluster

2019-04-01 Thread Carl Mueller
At my current job I had to roll my own backup system. Hopefully I can get
it OSS'd at some point. Here is a (now slightly outdated) presentation:

https://docs.google.com/presentation/d/13Aps-IlQPYAa_V34ocR0E8Q4C8W2YZ6Jn5_BYGrjqFk/edit#slide=id.p

If you are struggling with the disk I/O cost of the sstable backups/copies,
note that since sstables are append-only, if you adopt an incremental
approach to your backups, you only need to track a list of the current
files and upload the files that are new compared to a previous successful
backup. Your "manifest" of files for a node will need to have references to
the previous backup, and you'll wnat to "reset" with a full backup each
month.

I stole that idea from https://github.com/tbarbugli/cassandra_snapshotter.
I would have used that but we had more complex node access modes
(kubernetes, ssh through jumphosts, etc) and lots of other features needed
that weren't supported.

In AWS I use aws profiles to throttle the transfers, and parallelize across
nodes. The basic unit of a successful backup is a single node, but you'll
obviously want to track overall node success.

Note that in rack-based topologies you really only need one whole
successful rack if your RF is > # racks, and one DC.

Beware doing simultaneous flushes/snapshots across the cluster at once,
that might be the equivalent of a DDos. You might want to do a "jittered"
randomized preflush of the cluster first before doing the snapshotting.

Unfortunately, the nature of a distributed system is that snapshotting all
the nodes at the precise same time is a hard problem.

I also do not / have not used the built-in incremental backup feature of
cassandra, which can enable more precise point-in-time backups (aside from
the unflushed data in the commitlogs)

A note on incrementals with occaisional FULLs: Note that FULL backups
monthly might take more than a day or two, especially throttled. My
incrementals were originally looking up previous manifests using only 'most
recent", but then the long-running FULL backups were excluded from the
"chain" of incremental backups. So I now implement a fuzzy lookup for the
incrementals that prioritizes any FULL in the last 5 days over any more
recent incremental. Thus you can purge old backups you don't need more
safely using the monthly full backups as a reset point.

On Mon, Apr 1, 2019 at 1:08 PM Alain RODRIGUEZ  wrote:

> Hello Manish,
>
> I think any disk works. As long as it is big enough. It's also better if
> it's a reliable system (some kind of redundant raid, NAS, storage like GCS
> or S3...). We are not looking for speed mostly during a backup, but
> resiliency and not harming the source cluster mostly I would say.
> Then how fast you write to the backup storage system will probably be more
> often limited by what you can read from the source cluster.
> The backups have to be taken from running nodes, thus it's easy to
> overload the disk (reads), network (export backup data to final
> destination), and even CPU (as/if the machine handles the transfer).
>
> What are the best practices while designing backup storage system for big
>> Cassandra cluster?
>
>
> What is nice to have (not to say mandatory) is a system of incremental
> backups. You should not take the data from the nodes every time, or you'll
> either harm the cluster regularly OR spend days to transfer the data (if
> the amount of data grows big enough).
> I'm not speaking about Cassandra incremental snapshots, but of using
> something like AWS Snapshot, or copying this behaviour programmatically to
> take (copy, link?) old SSTables from previous backups when they exist, will
> greatly unload the clusters work and the resource needed as soon enough a
> substantial amount of the data should be coming from the backup data source
> itself. The problem with incremental snapshot is that when restoring, you
> have to restore multiple pieces, making it harder and involving a lot of
> compaction work.
> The "caching" technic mentioned above gives the best of the 2 worlds:
> - You will always backup from the nodes only the sstables you don’t have
> already in your backup storage system,
> - You will always restore easily as each backup is a full backup.
>
> It's not really a "hands-on" writing, but this should let you know about
> existing ways to do backups and the tradeoffs, I wrote this a year ago:
> http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html
> .
>
> It's a complex topic, I hope some of this is helpful to you.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> Le jeu. 28 mars 2019 à 11:24, manish khandelwal <
> manishkhandelwa...@gmail.com> a écrit :
>
>> Hi
>>
>>
>>
>> I would like to know is there any guideline for selecting storage device
>> (disk type) for Cassandra backups.
>>
>>
>>
>> As per my current observation, NearLine 

cassandra upgrades multi-DC in parallel

2019-03-12 Thread Carl Mueller
If there are multiple DCs in a cluster, is it safe to upgrade them in
parallel, with each DC doing a node-at-a-time?


Re: How to install an older minor release?

2019-04-10 Thread Carl Mueller
You'll have to setup a local repo like artifactory.

On Wed, Apr 3, 2019 at 4:33 AM Kyrylo Lebediev 
wrote:

> Hi Oleksandr,
>
> Yes, that was always the case. All older versions are removed from Debian
> repo index :(
>
>
>
> *From: *Oleksandr Shulgin 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Tuesday, April 2, 2019 at 20:04
> *To: *User 
> *Subject: *How to install an older minor release?
>
>
>
> Hello,
>
>
>
> We've just noticed that we cannot install older minor releases of Apache
> Cassandra from Debian packages, as described on this page:
> http://cassandra.apache.org/download/
>
>
>
> Previously we were doing the following at the last step: apt-get install
> cassandra==3.0.17
>
>
>
> Today it fails with error:
>
> E: Version '3.0.17' for 'cassandra' was not found
>
>
>
> And `apt-get show cassandra` reports only one version available, the
> latest released one: 3.0.18
>
> The packages for the older versions are still in the pool:
> http://dl.bintray.com/apache/cassandra/pool/main/c/cassandra/
>
>
>
> Was it always the case that only the latest version is available to be
> installed directly with apt or did something change recently?
>
>
>
> Regards,
>
> --
>
> Alex
>
>
>


Re: cass-2.2 trigger - how to get clustering columns and value?

2019-04-11 Thread Carl Mueller
Thank you all.

On Thu, Apr 11, 2019 at 4:35 AM Paul Chandler  wrote:

> Hi Carl,
>
> I now this is not exactly answering your question, but it may help with
> the split.
>
> I have split a multi tenancy  cluster several times using a similar
> process to TLP’s Data Centre Switch:
> http://thelastpickle.com/blog/2019/02/26/data-center-switch.html
>
> However instead of phase 3, we have split the cluster, by changing the
> seeds definition to only point at nodes within their own DC, and change the
> cluster name of the new DC. This last step does require a short downtime of
> the cluster.
>
> We have had success with this method, and if you are only want to track
> the updates to feed into the new cluster, this this will work, however it
> you want it for anything else then it doesn’t help at all.
>
> I can supply more details later if this method is of interest.
>
> Thanks
>
> Paul Chandler
>
> > On 10 Apr 2019, at 22:52, Carl Mueller 
> > 
> wrote:
> >
> > We have a multitenant cluster that we can't upgrade to 3.x easily, and
> we'd like to migrate some apps off of the shared cluster to dedicated
> clusters.
> >
> > This is a 2.2 cluster.
> >
> > So I'm trying a trigger to track updates while we transition and will
> send via kafka. Right now I'm just trying to extract all the data from the
> incoming updates
> >
> > so for
> >
> > public Collection augment(ByteBuffer key, ColumnFamily
> update) {
> >
> > the names returned by the update.getColumnNames() for an update of a
> table with two clustering columns and had a regular column update produced
> two CellName/Cells:
> >
> > one has no name, and no apparent raw value (bytebuffer is empty)
> >
> > the other is the data column.
> >
> > I can extract the primary key from the key field
> >
> > But how do I get the values of the two clustering columns? They aren't
> listed in the iterator, and they don't appear to be in the key field. Since
> clustering columns are encoded into the name of a cell, I'd imagine there
> might be some "unpacking" trick to that.
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


cass-2.2 trigger - how to get clustering columns and value?

2019-04-10 Thread Carl Mueller
We have a multitenant cluster that we can't upgrade to 3.x easily, and we'd
like to migrate some apps off of the shared cluster to dedicated clusters.

This is a 2.2 cluster.

So I'm trying a trigger to track updates while we transition and will send
via kafka. Right now I'm just trying to extract all the data from the
incoming updates

so for

public Collection augment(ByteBuffer key, ColumnFamily
update) {

the names returned by the update.getColumnNames() for an update of a table
with two clustering columns and had a regular column update produced two
CellName/Cells:

one has no name, and no apparent raw value (bytebuffer is empty)

the other is the data column.

I can extract the primary key from the key field

But how do I get the values of the two clustering columns? They aren't
listed in the iterator, and they don't appear to be in the key field. Since
clustering columns are encoded into the name of a cell, I'd imagine there
might be some "unpacking" trick to that.


Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

2019-04-17 Thread Carl Mueller
No, we just did the package upgrade 2.1.9 --> 2.2.13

It definitely feels like some indexes are being recalculated or the entire
sstables are being scanned due to suspected corruption.


On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa  wrote:

> There was a time when changing some of the parameters (especially bloom
> filter FP ratio) would cause the bloom filters to be rebuilt on startup if
> the sstables didnt match what was in the schema, leading to a delay like
> that and similar logs. Any chance you changed the schema on that table
> since the last time you restarted it?
>
>
>
> On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller
>  wrote:
>
>> Oh, the table in question is SizeTiered, had about 10 sstables total, it
>> was JBOD across two data directories.
>>
>> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <
>> carl.muel...@smartthings.com> wrote:
>>
>>> We are doing a ton of upgrades to get out of 2.1.x. We've done probably
>>> 20-30 clusters so far and have not encountered anything like this yet.
>>>
>>> After upgrade of a node, the restart takes a long time. like 10 minutes
>>> long. ALmost all of our other nodes took less than 2 minutes to upgrade
>>> (aside from sstableupgrades).
>>>
>>> The startup stalls on a particular table, it is the largest table at
>>> about 300GB, but we have upgraded other clusters with about that much data
>>> without this 8-10 minute delay. We have the ability to roll back the node,
>>> and the restart as a 2.1.x node is normal with no delays.
>>>
>>> Alas this is a prod cluster so we are going to try to sstable load the
>>> data on a lower environment and try to replicate the delay. If we can, we
>>> will turn on debug logging.
>>>
>>> This occurred on the first node we tried to upgrade. It is possible it
>>> is limited to only this node, but we are gunshy to play around with
>>> upgrades in prod.
>>>
>>> We have an automated upgrading program that flushes, snapshots, shuts
>>> down gossip, drains before upgrade, suppressed autostart on upgrade, and
>>> has worked about as flawlessly as one could hope for so far for 2.1->2.2
>>> and 2.2-> 3.11 upgrades.
>>>
>>> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
>>> Initializing .access_token
>>> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
>>> Initializing .refresh_token
>>> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
>>> Initializing .userid
>>> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
>>> Initializing .access_token_by_auth
>>>
>>> You can see the 6:30 delay in the startup log above. All the other
>>> keyspace/tables initialize in under a second.
>>>
>>>
>>>


Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

2019-04-17 Thread Carl Mueller
Will try if we get to replicate the node upgrade on that node, or if we
replicate in a lower env.

Thanks


On Wed, Apr 17, 2019 at 1:49 PM Jon Haddad  wrote:

> Let me be more specific - run the async java profiler and generate a
> flame graph to determine where CPU time is spent.
>
> On Wed, Apr 17, 2019 at 11:36 AM Jon Haddad  wrote:
> >
> > Run the async java profiler on the node to determine what it's doing:
> > https://github.com/jvm-profiling-tools/async-profiler
> >
> > On Wed, Apr 17, 2019 at 11:31 AM Carl Mueller
> >  wrote:
> > >
> > > No, we just did the package upgrade 2.1.9 --> 2.2.13
> > >
> > > It definitely feels like some indexes are being recalculated or the
> entire sstables are being scanned due to suspected corruption.
> > >
> > >
> > > On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa  wrote:
> > >>
> > >> There was a time when changing some of the parameters (especially
> bloom filter FP ratio) would cause the bloom filters to be rebuilt on
> startup if the sstables didnt match what was in the schema, leading to a
> delay like that and similar logs. Any chance you changed the schema on that
> table since the last time you restarted it?
> > >>
> > >>
> > >>
> > >> On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller <
> carl.muel...@smartthings.com.invalid> wrote:
> > >>>
> > >>> Oh, the table in question is SizeTiered, had about 10 sstables
> total, it was JBOD across two data directories.
> > >>>
> > >>> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller <
> carl.muel...@smartthings.com> wrote:
> > >>>>
> > >>>> We are doing a ton of upgrades to get out of 2.1.x. We've done
> probably 20-30 clusters so far and have not encountered anything like this
> yet.
> > >>>>
> > >>>> After upgrade of a node, the restart takes a long time. like 10
> minutes long. ALmost all of our other nodes took less than 2 minutes to
> upgrade (aside from sstableupgrades).
> > >>>>
> > >>>> The startup stalls on a particular table, it is the largest table
> at about 300GB, but we have upgraded other clusters with about that much
> data without this 8-10 minute delay. We have the ability to roll back the
> node, and the restart as a 2.1.x node is normal with no delays.
> > >>>>
> > >>>> Alas this is a prod cluster so we are going to try to sstable load
> the data on a lower environment and try to replicate the delay. If we can,
> we will turn on debug logging.
> > >>>>
> > >>>> This occurred on the first node we tried to upgrade. It is possible
> it is limited to only this node, but we are gunshy to play around with
> upgrades in prod.
> > >>>>
> > >>>> We have an automated upgrading program that flushes, snapshots,
> shuts down gossip, drains before upgrade, suppressed autostart on upgrade,
> and has worked about as flawlessly as one could hope for so far for
> 2.1->2.2 and 2.2-> 3.11 upgrades.
> > >>>>
> > >>>> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
> Initializing .access_token
> > >>>> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
> Initializing .refresh_token
> > >>>> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
> Initializing .userid
> > >>>> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
> Initializing .access_token_by_auth
> > >>>>
> > >>>> You can see the 6:30 delay in the startup log above. All the other
> keyspace/tables initialize in under a second.
> > >>>>
> > >>>>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


re: Trouble restoring with sstableloader

2019-04-18 Thread Carl Mueller
This is a response to a message from 2017 that I found unanswered on the
user list, we were getting the same error.

Also in this stackoverflow

https://stackoverflow.com/questions/53160611/frame-size-352518912-larger-than-max-length-15728640-exception-while-runnin/55751104#55751104

I have noted what we had to do to get things working. In this case it
appears the -tf and/or various keystore/truststore params weren't supplied.
In our case we weren't doing the -tf parameter.

... then we ran into the PKIX error.

Original message:
---

Hi all,

I've been running into the following issue while trying to restore a C*
database via sstableloader:

Could not retrieve endpoint ranges:
org.apache.thrift.transport.TTransportException: Frame size (352518912)
larger than max length (15728640)!
java.lang.RuntimeException: Could not retrieve endpoint ranges:
at
org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:283)
at
org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:144)
at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:95)
Caused by: org.apache.thrift.transport.TTransportException: Frame size
(352518912) larger than max length (15728640)!
at
org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:137)
at
org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
at
org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at
org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:362)
at
org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:284)
at
org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:191)
at
org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
at
org.apache.cassandra.thrift.Cassandra$Client.recv_describe_partitioner(Cassandra.java:1327)
at
org.apache.cassandra.thrift.Cassandra$Client.describe_partitioner(Cassandra.java:1315)
at
org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:256)
... 2 more

This seems odd since the frame size thrift is asking for is over 336 MB.

This is happening using Cassandra 2.0.12 | Thrift protocol 19.39.0

Any advice?

Thanks!

--Jim


2.1 cassandra 1 node down produces replica shortfall

2019-05-17 Thread Carl Mueller
Being one of our largest and unfortunately heaviest multi-tenant clusters,
and our last 2.1 prod cluster, we are encountering not enough replica
errors (need 2, only found 1) after only bringing down 1 node. 90 node
cluster, 30/dc, dcs are in europe, asia, and us. AWS.

Are there bugs for erroneous gossip state in 2.1.9? I know system.peers and
other issues can make gossip state detection a bit iffy, and AWS also
introduces uncertainty.

Java-driver is v3.7. It is primarily one app throwing the errors, but this
is the app without caching but with substantive query volume. It is RF3
also, while many of the other apps are RF5, which may also be contributing.


Re: What happens to empty partitions?

2019-05-17 Thread Carl Mueller
Eventually compaction will remove the row when the sstable is
merged/rewritten.

On Fri, May 17, 2019 at 8:06 AM Tom Vernon 
wrote:

> Hi, I'm having trouble getting my head around what happens to a partition
> that no longer contains any data. As TTL is applied at the column level
> (but not on the primary key), if I insert all values with a TTL then all of
> those values will be tombstoned and eventually purged once they reach that
> TTL. What then happens to that empty partition and key that had no TTL?
> (assuming no more writes will happen to that unique partition key).  Will
> they remain in the keyspace indefinitely? Does this pose any challenges in
> terms of performance/housekeeping?
>
> Thanks
> Tom
>


Re: schema for testing that has a lot of edge cases

2019-06-07 Thread Carl Mueller
Thanks.

On Mon, May 27, 2019 at 12:24 PM Alain RODRIGUEZ  wrote:

> Hello Carl,
>
> What you try to do sounds like a good match with one of the tool we
> open-sourced and actively maintain:
> https://github.com/thelastpickle/tlp-stress.
>
> TLP Stress allows you to use defined profiles (see
> https://github.com/thelastpickle/tlp-stress/tree/master/src/main/kotlin/com/thelastpickle/tlpstress/profiles)
> or create your own profiles and/or schemas. Contributions are welcome. You
> can tune workloads, the read/write ratio, the number of distinct
> partitions, number of operations to run...
>
> You might need multiple client to maximize the throughput, depending on
> instances in use and your own testing goals.
>
> version specific stuff to 2.1, 2.2, 3.x, 4.x
>
>
> In case that might be of some use as well, we like to use it combined with
> another of our tools: TLP Cluster (
> https://github.com/thelastpickle/tlp-cluster). We can the easily create
> and destroy Cassandra environments (on AWS) including Cassandra servers,
> client and monitoring (Prometheus).
>
> You can have a look anyway, I think both projects might be of interest to
> reach your goal.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> Le jeu. 23 mai 2019 à 21:25, Carl Mueller
>  a écrit :
>
>> Does anyone have any schema / schema generation that can be used for
>> general testing that has lots of complicated aspects and data?
>>
>> For example, it has a bunch of different rk/ck variations, column data
>> types, altered /added columns and data (which can impact sstables and
>> compaction),
>>
>> Mischeivous data to prepopulate (such as
>> https://github.com/minimaxir/big-list-of-naughty-strings for strings,
>> ugly keys in maps, semi-evil column names) of sufficient size to get on
>> most nodes of a 3-5 node cluster
>>
>> superwide rows
>> large key values
>>
>> version specific stuff to 2.1, 2.2, 3.x, 4.x
>>
>> I'd be happy to centralize this in a github if this doesn't exist
>> anywhere yet
>>
>>
>>


postmortem on 2.2.13 scale out difficulties

2019-06-11 Thread Carl Mueller
We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS, IPV6

Needed to scale out the asia datacenter, which was 5 nodes, europe and us
were 25 nodes

We were running into bootstrapping issues where the new node failed to
bootstrap/stream, it failed with

"java.lang.RuntimeException: A node required to move the data consistently
is down"

...even though they were all up based on nodetool status prior to adding
the node.

First we increased the phi_convict_threshold to 12, and that did not help.

CASSANDRA-12281 appeared similar to what we had problems with, but I don't
think we hit that. Somewhere in there someone wrote

"For us, the workaround is either deleting the data (then bootstrap again),
or increasing the ring_delay_ms. And the larger the cluster is, the longer
ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster, it
requires ring_delay_ms to be >50seconds. For a 70 nodes cluster,
>100seconds. Default is 30seconds."

Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and
it finally worked.

side note:

During the rolling restarts for setting phi_convict_threshold we observed
quite a lot of status map variance between nodes (we have a program to poll
all of a datacenter or cluster's view of the gossipinfo and statuses. AWS
appears to have variance in networking based on the phi_convict_threshold
advice, I'm not sure if our difficulties were typical in that regard and/or
if our IPV6 and/or globally distributed datacenters were exacerbating
factors.

We could not reproduce this in loadtest, although loadtest is only eu and
us (but is IPV6)


Re: upgrade pinning to v3 protocol: massive drop in writes

2019-06-25 Thread Carl Mueller
Oh we are 2.2.13 currently, seems to be 3.7.1 for the java-driver

On Tue, Jun 25, 2019 at 1:11 PM Carl Mueller 
wrote:

> We have an app that needs to be pinned to v3 protocol for the upgrade to
> 3.11.X
>
> ... we rolled out the v3 "pinning" and the amount of write counts and
> network traffice plummeted by 60-90%. The app seems to be functioning
> properly.
>
> has anyone seen anything like this? Could it be "Custom Payloads" in v4?
>


Re: upgrade pinning to v3 protocol: massive drop in writes

2019-06-25 Thread Carl Mueller
Nevermind, it would appear once we looked further out on the metrics there
was some huge bump about a month ago from the levels we see now


On Tue, Jun 25, 2019 at 1:35 PM Carl Mueller 
wrote:

> Oh we are 2.2.13 currently, seems to be 3.7.1 for the java-driver
>
> On Tue, Jun 25, 2019 at 1:11 PM Carl Mueller 
> wrote:
>
>> We have an app that needs to be pinned to v3 protocol for the upgrade to
>> 3.11.X
>>
>> ... we rolled out the v3 "pinning" and the amount of write counts and
>> network traffice plummeted by 60-90%. The app seems to be functioning
>> properly.
>>
>> has anyone seen anything like this? Could it be "Custom Payloads" in v4?
>>
>


upgrade pinning to v3 protocol: massive drop in writes

2019-06-25 Thread Carl Mueller
We have an app that needs to be pinned to v3 protocol for the upgrade to
3.11.X

... we rolled out the v3 "pinning" and the amount of write counts and
network traffice plummeted by 60-90%. The app seems to be functioning
properly.

has anyone seen anything like this? Could it be "Custom Payloads" in v4?


Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
And once the cluster token map formation is done, it starts bootstrap and
we get a ton of these:

WARN  [MessagingService-Incoming-/2406:da14:95b:4503:910e:23fd:dafa:9983]
2019-06-12 15:22:04,760 IncomingTcpConnection.java:100 -
UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
cfId=df425400-c331-11e8-8b96-4b7f4d58af68

And then after LOTS of those

INFO  [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING:
Starting to bootstrap...
INFO  [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream
#05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for
Bootstrap
INFO  [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap
completed! for the tokens [-7314981925085449175, ... bunch of tokens...
5499447097629838103]


On Wed, Jun 12, 2019 at 12:07 PM Carl Mueller 
wrote:

> One node at a time: yes that is what we are doing
>
> We have not tried the streaming_socket_timeout_in_ms. It is currently 24
> hours. (```streaming_socket_timeout_in_ms=8640```) which would cover
> the bootstrap timeframe we have seen before (1-2 hours per node)
>
> Since it joins with no data, it is serving erroneous data. We may try
> bootstrap rejoin and the JVM_OPT The node appears to think it has
> bootstrapped even though the gossipinfo shows the new node has a different
> schema version.
>
> We had scaled EU and US from 5 --> 25 without incident (one at a time),
> and since we increased ring_delay_ms worked haphazardly to get us four
> joins, and since then failure.
>
> The debug log shows:
>
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 -
> Not pulling schema because versions match or shouldPullSchemaFrom returned
> false
> INFO  [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 -
> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
> INFO  [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 -
> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 -
> Not pulling schema because versions match or shouldPullSchemaFrom returned
> false
> INFO  [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node
> /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 -
> Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token
> [-1028768087263234868, .., 921670352349030554]
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
> -1028768087263234868
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
> -1045740236536355596
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
> -1184422937682103096
> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
> -1201924032068728250
>
> All the nodes appear to be reporting "Not pulling schema becuase versions
> match or shouldPullSchmeaFrom returned false. That code
> (MigrationManager.java) makes reference to a "gossip only" node, did we get
> stuck in that somehow.
>
> On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A  wrote:
>
>>
>>
>>
>>
>> Adding one node at a time – is that successful?
>>
>> Check value of streaming_socket_timeout_in_ms parameter in cassandra.yaml
>> and increase if needed.
>>
>> Have you tried Nodetool bootstrap resume & jvm option i.e.
>> JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false"  ?
>>
>>
>>
>>
>>
>> *From:* Carl Mueller [mailto:carl.muel...@smartthings.com.INVALID]
>> *Sent:* Wednesday, June 12, 2019 11:35 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: postmortem on 2.2.13 scale out difficulties
>>
>>
>>
>> We only were able to scale out four nodes and then failures started
>> occurring, including multiple instances of nodes joining a cluster without
>> streaming.
>>
>>
>>
>> Sigh.
>>
>>
>>
>> On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller <
>> carl.muel

Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
We only were able to scale out four nodes and then failures started
occurring, including multiple instances of nodes joining a cluster without
streaming.

Sigh.

On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller 
wrote:

> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS,
> IPV6
>
> Needed to scale out the asia datacenter, which was 5 nodes, europe and us
> were 25 nodes
>
> We were running into bootstrapping issues where the new node failed to
> bootstrap/stream, it failed with
>
> "java.lang.RuntimeException: A node required to move the data consistently
> is down"
>
> ...even though they were all up based on nodetool status prior to adding
> the node.
>
> First we increased the phi_convict_threshold to 12, and that did not help.
>
> CASSANDRA-12281 appeared similar to what we had problems with, but I don't
> think we hit that. Somewhere in there someone wrote
>
> "For us, the workaround is either deleting the data (then bootstrap
> again), or increasing the ring_delay_ms. And the larger the cluster is, the
> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster,
> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster,
> >100seconds. Default is 30seconds."
>
> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and
> it finally worked.
>
> side note:
>
> During the rolling restarts for setting phi_convict_threshold we observed
> quite a lot of status map variance between nodes (we have a program to poll
> all of a datacenter or cluster's view of the gossipinfo and statuses. AWS
> appears to have variance in networking based on the phi_convict_threshold
> advice, I'm not sure if our difficulties were typical in that regard and/or
> if our IPV6 and/or globally distributed datacenters were exacerbating
> factors.
>
> We could not reproduce this in loadtest, although loadtest is only eu and
> us (but is IPV6)
>


Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
One node at a time: yes that is what we are doing

We have not tried the streaming_socket_timeout_in_ms. It is currently 24
hours. (```streaming_socket_timeout_in_ms=8640```) which would cover
the bootstrap timeframe we have seen before (1-2 hours per node)

Since it joins with no data, it is serving erroneous data. We may try
bootstrap rejoin and the JVM_OPT The node appears to think it has
bootstrapped even though the gossipinfo shows the new node has a different
schema version.

We had scaled EU and US from 5 --> 25 without incident (one at a time), and
since we increased ring_delay_ms worked haphazardly to get us four joins,
and since then failure.

The debug log shows:

DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229
DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899
DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 -
Not pulling schema because versions match or shouldPullSchemaFrom returned
false
INFO  [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 -
Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
INFO  [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 -
Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 -
Not pulling schema because versions match or shouldPullSchemaFrom returned
false
INFO  [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node
/2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster
DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 -
Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token
[-1028768087263234868, .., 921670352349030554]
DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
-1028768087263234868
DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
-1045740236536355596
DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
-1184422937682103096
DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
-1201924032068728250

All the nodes appear to be reporting "Not pulling schema becuase versions
match or shouldPullSchmeaFrom returned false. That code
(MigrationManager.java) makes reference to a "gossip only" node, did we get
stuck in that somehow.

On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A  wrote:

>
>
>
>
> Adding one node at a time – is that successful?
>
> Check value of streaming_socket_timeout_in_ms parameter in cassandra.yaml
> and increase if needed.
>
> Have you tried Nodetool bootstrap resume & jvm option i.e.
> JVM_OPTS="$JVM_OPTS -Dcassandra.consistent.rangemovement=false"  ?
>
>
>
>
>
> *From:* Carl Mueller [mailto:carl.muel...@smartthings.com.INVALID]
> *Sent:* Wednesday, June 12, 2019 11:35 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: postmortem on 2.2.13 scale out difficulties
>
>
>
> We only were able to scale out four nodes and then failures started
> occurring, including multiple instances of nodes joining a cluster without
> streaming.
>
>
>
> Sigh.
>
>
>
> On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller 
> wrote:
>
> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS,
> IPV6
>
> Needed to scale out the asia datacenter, which was 5 nodes, europe and us
> were 25 nodes
>
> We were running into bootstrapping issues where the new node failed to
> bootstrap/stream, it failed with
>
>
>
> "java.lang.RuntimeException: A node required to move the data consistently
> is down"
>
>
>
> ...even though they were all up based on nodetool status prior to adding
> the node.
>
> First we increased the phi_convict_threshold to 12, and that did not help.
>
> CASSANDRA-12281 appeared similar to what we had problems with, but I don't
> think we hit that. Somewhere in there someone wrote
>
>
>
> "For us, the workaround is either deleting the data (then bootstrap
> again), or increasing the ring_delay_ms. And the larger the cluster is, the
> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster,
> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster,
> >100seconds. Default is 30seconds."
>
> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and
> it finally worked.
>
> side note:
>
> During the rolling restarts for setti

Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
We're getting

DEBUG [GossipStage:1] 2019-06-12 15:20:07,797 MigrationManager.java:96 -
Not pulling schema because versions match or shouldPullSchemaFrom returned
false

multiple times, as it contacts the nodes.

On Wed, Jun 12, 2019 at 11:35 AM Carl Mueller 
wrote:

> We only were able to scale out four nodes and then failures started
> occurring, including multiple instances of nodes joining a cluster without
> streaming.
>
> Sigh.
>
> On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller 
> wrote:
>
>> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS,
>> IPV6
>>
>> Needed to scale out the asia datacenter, which was 5 nodes, europe and us
>> were 25 nodes
>>
>> We were running into bootstrapping issues where the new node failed to
>> bootstrap/stream, it failed with
>>
>> "java.lang.RuntimeException: A node required to move the data
>> consistently is down"
>>
>> ...even though they were all up based on nodetool status prior to adding
>> the node.
>>
>> First we increased the phi_convict_threshold to 12, and that did not
>> help.
>>
>> CASSANDRA-12281 appeared similar to what we had problems with, but I
>> don't think we hit that. Somewhere in there someone wrote
>>
>> "For us, the workaround is either deleting the data (then bootstrap
>> again), or increasing the ring_delay_ms. And the larger the cluster is, the
>> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster,
>> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster,
>> >100seconds. Default is 30seconds."
>>
>> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and
>> it finally worked.
>>
>> side note:
>>
>> During the rolling restarts for setting phi_convict_threshold we observed
>> quite a lot of status map variance between nodes (we have a program to poll
>> all of a datacenter or cluster's view of the gossipinfo and statuses. AWS
>> appears to have variance in networking based on the phi_convict_threshold
>> advice, I'm not sure if our difficulties were typical in that regard and/or
>> if our IPV6 and/or globally distributed datacenters were exacerbating
>> factors.
>>
>> We could not reproduce this in loadtest, although loadtest is only eu and
>> us (but is IPV6)
>>
>


Re: postmortem on 2.2.13 scale out difficulties

2019-06-12 Thread Carl Mueller
I posted a bug, cassandra-15155 :
https://issues.apache.org/jira/browse/CASSANDRA-15155?jql=project%20%3D%20CASSANDRA

It seems VERY similar to
https://issues.apache.org/jira/browse/CASSANDRA-6648

On Wed, Jun 12, 2019 at 12:14 PM Carl Mueller 
wrote:

> And once the cluster token map formation is done, it starts bootstrap and
> we get a ton of these:
>
> WARN  [MessagingService-Incoming-/2406:da14:95b:4503:910e:23fd:dafa:9983]
> 2019-06-12 15:22:04,760 IncomingTcpConnection.java:100 -
> UnknownColumnFamilyException reading from socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
> cfId=df425400-c331-11e8-8b96-4b7f4d58af68
>
> And then after LOTS of those
>
> INFO  [main] 2019-06-12 15:23:25,515 StorageService.java:1142 - JOINING:
> Starting to bootstrap...
> INFO  [main] 2019-06-12 15:23:25,525 StreamResultFuture.java:87 - [Stream
> #05af9ee0-8d26-11e9-85c1-bd5476090c54] Executing streaming plan for
> Bootstrap
> INFO  [main] 2019-06-12 15:23:25,526 StorageService.java:1199 - Bootstrap
> completed! for the tokens [-7314981925085449175, ... bunch of tokens...
> 5499447097629838103]
>
>
> On Wed, Jun 12, 2019 at 12:07 PM Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> One node at a time: yes that is what we are doing
>>
>> We have not tried the streaming_socket_timeout_in_ms. It is currently 24
>> hours. (```streaming_socket_timeout_in_ms=8640```) which would cover
>> the bootstrap timeframe we have seen before (1-2 hours per node)
>>
>> Since it joins with no data, it is serving erroneous data. We may try
>> bootstrap rejoin and the JVM_OPT The node appears to think it has
>> bootstrapped even though the gossipinfo shows the new node has a different
>> schema version.
>>
>> We had scaled EU and US from 5 --> 25 without incident (one at a time),
>> and since we increased ring_delay_ms worked haphazardly to get us four
>> joins, and since then failure.
>>
>> The debug log shows:
>>
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
>> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 9200286188287490229
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,559 StorageService.java:1998 -
>> New node /2a05:d018:af:1108:86f4:d628:6bca:6983 at token 950856676715905899
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,563 MigrationManager.java:96 -
>> Not pulling schema because versions match or shouldPullSchemaFrom returned
>> false
>> INFO  [GossipStage:1] 2019-06-12 15:20:08,563 TokenMetadata.java:464 -
>> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
>> INFO  [GossipStage:1] 2019-06-12 15:20:08,564 TokenMetadata.java:464 -
>> Updating topology for /2a05:d018:af:1108:86f4:d628:6bca:6983
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,565 MigrationManager.java:96 -
>> Not pulling schema because versions match or shouldPullSchemaFrom returned
>> false
>> INFO  [GossipStage:1] 2019-06-12 15:20:08,565 Gossiper.java:1027 - Node
>> /2600:1f18:4b4:5903:64af:955e:b65:8d83 is now part of the cluster
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,587 StorageService.java:1928 -
>> Node /2600:1f18:4b4:5903:64af:955e:b65:8d83 state NORMAL, token
>> [-1028768087263234868, .., 921670352349030554]
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
>> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
>> -1028768087263234868
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,588 StorageService.java:1998 -
>> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
>> -1045740236536355596
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
>> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
>> -1184422937682103096
>> DEBUG [GossipStage:1] 2019-06-12 15:20:08,589 StorageService.java:1998 -
>> New node /2600:1f18:4b4:5903:64af:955e:b65:8d83 at token
>> -1201924032068728250
>>
>> All the nodes appear to be reporting "Not pulling schema becuase versions
>> match or shouldPullSchmeaFrom returned false. That code
>> (MigrationManager.java) makes reference to a "gossip only" node, did we get
>> stuck in that somehow.
>>
>> On Wed, Jun 12, 2019 at 11:45 AM ZAIDI, ASAD A  wrote:
>>
>>>
>>>
>>>
>>>
>>> Adding one node at a time – is that successful?
>>>
>>> Check value of streaming_socket_timeout_in_ms parameter in
>>> cassandra.yaml and increase if needed.
>>>
>>> Have you tried Nodetool bootstrap resume & jvm option i.e.
>>> JVM_OPTS="$JVM_OPTS -Dcassandra.consist

schema for testing that has a lot of edge cases

2019-05-23 Thread Carl Mueller
Does anyone have any schema / schema generation that can be used for
general testing that has lots of complicated aspects and data?

For example, it has a bunch of different rk/ck variations, column data
types, altered /added columns and data (which can impact sstables and
compaction),

Mischeivous data to prepopulate (such as
https://github.com/minimaxir/big-list-of-naughty-strings for strings, ugly
keys in maps, semi-evil column names) of sufficient size to get on most
nodes of a 3-5 node cluster

superwide rows
large key values

version specific stuff to 2.1, 2.2, 3.x, 4.x

I'd be happy to centralize this in a github if this doesn't exist anywhere
yet


2019 manual deletion of sstables

2019-05-07 Thread Carl Mueller
Last my googling had some people doing this back in 2.0.x days, and that
you could do it if you brought a node down, removed the desired sstable
#'s  artifacts (Data/Index/etc), and then started up. Probably also with a
clearing of the saved caches.

A decent-ish amount of data (256G) in a 2.1 cluster we are trying to
upgrade has about 60-70% of the data that could be purged.

The data has only partition keys (no column key) and is only written once.
So the sstables that are expired don't have data leaking across other
sstables.

So can we do this:

1) bring down node
2) remove an sstable with obviously old data (we use sstablemetadata tools
to doublecheck)
3) clear saved caches
4) start back up

And then repair afterward?

The table is STCS. We are trying to avoid writing a purge program and
prompting a full compaction.


Re: 2019 manual deletion of sstables

2019-05-07 Thread Carl Mueller
(repair would be done after all the nodes with obviously deletable sstables
were deleted)
(we may then do a purge program anyway)
(this would seem to get rid of 60-90% of the purgable data without
incurring a big round of tombstones and compaction)

On Tue, May 7, 2019 at 12:05 PM Carl Mueller 
wrote:

> Last my googling had some people doing this back in 2.0.x days, and that
> you could do it if you brought a node down, removed the desired sstable
> #'s  artifacts (Data/Index/etc), and then started up. Probably also with a
> clearing of the saved caches.
>
> A decent-ish amount of data (256G) in a 2.1 cluster we are trying to
> upgrade has about 60-70% of the data that could be purged.
>
> The data has only partition keys (no column key) and is only written once.
> So the sstables that are expired don't have data leaking across other
> sstables.
>
> So can we do this:
>
> 1) bring down node
> 2) remove an sstable with obviously old data (we use sstablemetadata tools
> to doublecheck)
> 3) clear saved caches
> 4) start back up
>
> And then repair afterward?
>
> The table is STCS. We are trying to avoid writing a purge program and
> prompting a full compaction.
>


Re: Cassandra taking very long to start and server under heavy load

2019-05-07 Thread Carl Mueller
You may have encountered the same behavior we have encountered going from
2.1 --> 2.2 a week or so ago.

We also have multiple data dirs. Hm.

In our case, we will purge the data of the big offending table.

HOw big are your nodes?

On Tue, May 7, 2019 at 1:40 AM Evgeny Inberg  wrote:

> Still no resolution for this. Did anyone else encounter same behavior?
>
> On Thu, May 2, 2019 at 1:54 PM Evgeny Inberg  wrote:
>
>> Yes, sstable upgraded on each node.
>>
>> On Thu, 2 May 2019, 13:39 Nick Hatfield 
>> wrote:
>>
>>> Just curious but, did you make sure to run the sstable upgrade after you
>>> completed the move from 2.x to 3.x ?
>>>
>>>
>>>
>>> *From:* Evgeny Inberg [mailto:evg...@gmail.com]
>>> *Sent:* Thursday, May 02, 2019 1:31 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: Cassandra taking very long to start and server under
>>> heavy load
>>>
>>>
>>>
>>> Using a sigle data disk.
>>>
>>> Also, it is performing mostly heavy read operations according to the
>>> metrics cillected.
>>>
>>> On Wed, 1 May 2019, 20:14 Jeff Jirsa  wrote:
>>>
>>> Do you have multiple data disks?
>>>
>>> Cassandra 6696 changed behavior with multiple data disks to make it
>>> safer in the situation that one disk fails . It may be copying data to the
>>> right places on startup, can you see if sstables are being moved on disk?
>>>
>>> --
>>>
>>> Jeff Jirsa
>>>
>>>
>>>
>>>
>>> On May 1, 2019, at 6:04 AM, Evgeny Inberg  wrote:
>>>
>>> I have upgraded a Cassandra cluster from version 2.0.x to 3.11.4 going
>>> trough 2.1.14.
>>>
>>> After the upgrade, noticed that each node is taking about 10-15 minutes
>>> to start, and server is under a very heavy load.
>>>
>>> Did some digging around and got view leads from the debug log.
>>>
>>> Messages like:
>>>
>>> *Keyspace.java:351 - New replication settings for keyspace system_auth -
>>> invalidating disk boundary caches *
>>>
>>> *CompactionStrategyManager.java:380 - Recreating compaction strategy -
>>> disk boundaries are out of date for system_auth.roles.*
>>>
>>>
>>>
>>> This is repeating for all keyspaces.
>>>
>>>
>>>
>>> Any suggestion to check and what might cause this to happen on every
>>> start?
>>>
>>>
>>>
>>> Thanks!e
>>>
>>>


Re: TWCS: 2.2 ring expand massive overstream 100000 sstables

2019-07-10 Thread Carl Mueller
So I looked at the code and the compaction seems to concentrate / find
"most interesting" the newer buckets for compaction.

Our problem is that we have a huge fragmented number of sstables in buckets
that are a few days old. (not yet expired, our expiration is 7 days), so
the sstable selection algorithm doesn't find those particularly
"interesting"

Perhaps we should have something that tries to stabilize the sstable count
across buckets, maybe with some configurable thresholds for deciding what
to prioritize

So even though we opened the floodgates on the compaction throughput and
compactors on the nodes with elevated sstable count, they are still
basically working on the newer/incoming data.

We will probably wait for the 7 days and hope all those fragmented tables
then get nuked.

We could use jmx black magic to force merging, but I know TWCS has metadata
on the sstables identifying their bucket, and I'm not sure if manually
forcing compaction would disrupt that metadata.

We will vertically scale aws instances if we need to in the short run, we
have stabilized the sstable counts on the nodes that have elevated levels,
and we shall see if things return to normal in three or four more days when
the fragments expire.



On Tue, Jul 9, 2019 at 11:12 AM Carl Mueller 
wrote:

>  The existing 15 node cluster had about 450-500GB/node, most in one TWCS
> table. Data is applied with a 7-day TTL. Our cluster couldn't be expanded
> due to a bit of political foot dragging and new load of about 2x-3x started
> up around the time we started expanding.
>
> about 500 sstables per node, with one outlier of 16,000 files (.Data.db to
> be clear).
>
> The 16 data.db sstable files grew from 500 steadily over a week.
> Probably compaction fell behind that was exacerbated by growing load, but
> the sstable count growth appears to have started before the heaviest load
> increases.
>
> We attempted to expand figuring the cluster was under duress. The first
> addition still had 150,000 files/25,000 Data.db files, and about 500 GB
>
> three other nodes have started to gain in number of files as well.
>
> Our last attempted expand filled a 2 terabyte disk and we ended up with
> over 100,000 Data.db sstable files and 60 files overall, and it hadn't
> finished. We killed that node.
>
> Wide rows do not appear to be a problem.
>
> We are vertically scaling our nodes to bigger hardware and unthrottling
> compaction and doubling compactors on the nodes that are starting to
> inflate numbers of sstables, that appears to be helping.
>
> But the overstreaming is still a mystery.
>
> Table compaction settings:
>
> ) WITH bloom_filter_fp_chance = 0.01
> AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
> AND comment = ''
> AND compaction = {'compaction_window_unit': 'HOURS',
> 'compaction_window_size': '4', 'class':
> 'com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy'}
> AND compression = {'sstable_compression':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 0
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99.0PERCENTILE';
>
>
>
>
>
>
>


TWCS: 2.2 ring expand massive overstream 100000 sstables

2019-07-09 Thread Carl Mueller
 The existing 15 node cluster had about 450-500GB/node, most in one TWCS
table. Data is applied with a 7-day TTL. Our cluster couldn't be expanded
due to a bit of political foot dragging and new load of about 2x-3x started
up around the time we started expanding.

about 500 sstables per node, with one outlier of 16,000 files (.Data.db to
be clear).

The 16 data.db sstable files grew from 500 steadily over a week.
Probably compaction fell behind that was exacerbated by growing load, but
the sstable count growth appears to have started before the heaviest load
increases.

We attempted to expand figuring the cluster was under duress. The first
addition still had 150,000 files/25,000 Data.db files, and about 500 GB

three other nodes have started to gain in number of files as well.

Our last attempted expand filled a 2 terabyte disk and we ended up with
over 100,000 Data.db sstable files and 60 files overall, and it hadn't
finished. We killed that node.

Wide rows do not appear to be a problem.

We are vertically scaling our nodes to bigger hardware and unthrottling
compaction and doubling compactors on the nodes that are starting to
inflate numbers of sstables, that appears to be helping.

But the overstreaming is still a mystery.

Table compaction settings:

) WITH bloom_filter_fp_chance = 0.01
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'compaction_window_unit': 'HOURS',
'compaction_window_size': '4', 'class':
'com.jeffjirsa.cassandra.db.compaction.TimeWindowCompactionStrategy'}
AND compression = {'sstable_compression':
'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 0
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';


2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

2019-04-17 Thread Carl Mueller
We are doing a ton of upgrades to get out of 2.1.x. We've done probably
20-30 clusters so far and have not encountered anything like this yet.

After upgrade of a node, the restart takes a long time. like 10 minutes
long. ALmost all of our other nodes took less than 2 minutes to upgrade
(aside from sstableupgrades).

The startup stalls on a particular table, it is the largest table at about
300GB, but we have upgraded other clusters with about that much data
without this 8-10 minute delay. We have the ability to roll back the node,
and the restart as a 2.1.x node is normal with no delays.

Alas this is a prod cluster so we are going to try to sstable load the data
on a lower environment and try to replicate the delay. If we can, we will
turn on debug logging.

This occurred on the first node we tried to upgrade. It is possible it is
limited to only this node, but we are gunshy to play around with upgrades
in prod.

We have an automated upgrading program that flushes, snapshots, shuts down
gossip, drains before upgrade, suppressed autostart on upgrade, and has
worked about as flawlessly as one could hope for so far for 2.1->2.2 and
2.2-> 3.11 upgrades.

INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
Initializing .access_token
INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
Initializing .refresh_token
INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
Initializing .userid
INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
Initializing .access_token_by_auth

You can see the 6:30 delay in the startup log above. All the other
keyspace/tables initialize in under a second.


Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

2019-04-17 Thread Carl Mueller
Oh, the table in question is SizeTiered, had about 10 sstables total, it
was JBOD across two data directories.

On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller 
wrote:

> We are doing a ton of upgrades to get out of 2.1.x. We've done probably
> 20-30 clusters so far and have not encountered anything like this yet.
>
> After upgrade of a node, the restart takes a long time. like 10 minutes
> long. ALmost all of our other nodes took less than 2 minutes to upgrade
> (aside from sstableupgrades).
>
> The startup stalls on a particular table, it is the largest table at about
> 300GB, but we have upgraded other clusters with about that much data
> without this 8-10 minute delay. We have the ability to roll back the node,
> and the restart as a 2.1.x node is normal with no delays.
>
> Alas this is a prod cluster so we are going to try to sstable load the
> data on a lower environment and try to replicate the delay. If we can, we
> will turn on debug logging.
>
> This occurred on the first node we tried to upgrade. It is possible it is
> limited to only this node, but we are gunshy to play around with upgrades
> in prod.
>
> We have an automated upgrading program that flushes, snapshots, shuts down
> gossip, drains before upgrade, suppressed autostart on upgrade, and has
> worked about as flawlessly as one could hope for so far for 2.1->2.2 and
> 2.2-> 3.11 upgrades.
>
> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 -
> Initializing .access_token
> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 -
> Initializing .refresh_token
> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 -
> Initializing .userid
> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 -
> Initializing .access_token_by_auth
>
> You can see the 6:30 delay in the startup log above. All the other
> keyspace/tables initialize in under a second.
>
>
>


Re: TWCS generates large numbers of sstables on only some nodes

2019-07-16 Thread Carl Mueller
stays consistently in the 40-60 range, but only recent tables are being
compacted.

What I fear is that TWCS when it hits a certain compaction threshold keeps
compacting the same tables adding a slice of the most recently flushed data
and falls behind.

I'd rather it compacted fragments of sstable files from the same bucket
together rather than constantly append to the same sstable

But that assumption is based on a superficial examination of the compactor
code.


On Tue, Jul 16, 2019 at 12:47 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Mon, Jul 15, 2019 at 6:20 PM Carl Mueller
>  wrote:
>
>> Related to our overstreaming, we have a cluster of about 25 nodes, with
>> most at about 1000 sstable files (Data + others).
>>
>> And about four that are at 20,000 - 30,000 sstable files (Data+Index+etc).
>>
>> We have vertically scaled the outlier machines and turned off compaction
>> throttling thinking it was compaction that couldn't keep up. That
>> stabilized the growth, but the sstable count is not going down.
>>
>> The TWCS code seems to highly bias towards "recent" tables for
>> compaction. We figured we'd boost the throughput/compactors and that would
>> solve the more recent ones, and the older ones would fall off. But the
>> number of sstables has remained high on a daily basis on the couple "bad
>> nodes".
>>
>> Is this simply a lack of sufficient compaction throughput? Is there
>> something in TWCS that would force frequent flushing more than normal?
>>
>
> What does nodetool compactionstats says about pending compaction tasks on
> the affected nodes with the high number of files?
>
> Regards,
> --
> Alex
>
>


Impact of a large number of components in column key/cluster key

2019-08-06 Thread Carl Mueller
Say there are 1 vs three vs five vs 8 parts of a column key.

Will range slicing slow down the more parts there are? Will compactions be
impacted?


TWCS generates large numbers of sstables on only some nodes

2019-07-15 Thread Carl Mueller
Related to our overstreaming, we have a cluster of about 25 nodes, with
most at about 1000 sstable files (Data + others).

And about four that are at 20,000 - 30,000 sstable files (Data+Index+etc).

We have vertically scaled the outlier machines and turned off compaction
throttling thinking it was compaction that couldn't keep up. That
stabilized the growth, but the sstable count is not going down.

The TWCS code seems to highly bias towards "recent" tables for compaction.
We figured we'd boost the throughput/compactors and that would solve the
more recent ones, and the older ones would fall off. But the number of
sstables has remained high on a daily basis on the couple "bad nodes".

Is this simply a lack of sufficient compaction throughput? Is there
something in TWCS that would force frequent flushing more than normal?


Re: Breaking up major compacted Sstable with TWCS

2019-07-15 Thread Carl Mueller
Does sstablesplit properly restore the time-bucket the data? That appears
to be size-based only.

On Fri, Jul 12, 2019 at 5:55 AM Rhys Campbell
 wrote:

>
> https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/tools/toolsSStables/toolsSSTableSplit.html
>
> Leon Zaruvinsky  schrieb am Fr., 12. Juli 2019,
> 00:06:
>
>> Hi,
>>
>> We are switching a table to run using TWCS. However, after running the
>> alter statement, we ran a major compaction without understanding the
>> implications.
>>
>> Now, while new sstables are properly being created according to the time
>> window, there is a giant sstable sitting around waiting for expiration.
>>
>> Is there a way we can break it up again?  Running the alter statement
>> again doesn’t seem to be touching it.
>>
>> Thanks,
>> Leon
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>


Re: upgrading from 2.x TWCS to 3.x TWCS

2019-09-27 Thread Carl Mueller
Or can we just do this safely in a side jar?

package com.jeffjirsa.cassandra.db.compaction;

import org.apache.cassandra.db.ColumnFamilyStore;
import org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy;

import java.util.Map;

public class TimeWindowCompactionStrategy extends TimeWindowCompactionStrategy {
public TimeWindowCompactionStrategy(ColumnFamilyStore cfs,
Map options) {
super(cfs,options);
}
}


On Fri, Sep 27, 2019 at 12:29 PM Carl Mueller 
wrote:

> So IF that delegate class would work:
>
> 1) create jar with the delegate class
> 2) deploy jar along with upgrade on node
> 3) once all nodes are upgraded, issue ALTER to change to the
> org.apache.cassandra TWCS class.
>
> will that trigger full recompaction?
>
> On Fri, Sep 27, 2019 at 12:25 PM Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> For example (delegating all public methods from
>> AbstractCompactionStrategy and some from TimeWindowCompactionStrategy
>> EXCEPT getName() )
>>
>> package com.jeffjirsa.cassandra.db.compaction;
>>
>> public class TimeWindowCompactionStrategy extends
>> AbstractCompactionStrategy
>> {
>> org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy
>> delegate;
>>
>> public TimeWindowCompactionStrategy(ColumnFamilyStore cfs,
>> Map options) {
>> delegate = new
>> org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy(cfs,options);
>> }
>>
>>
>> public Directories getDirectories() { return
>> delegate.getDirectories(); }
>> public synchronized void pause() { delegate.pause(); }
>> public synchronized void resume() { delegate.resume(); }
>> public synchronized void startup() { delegate.startup(); }
>> public synchronized void shutdown() { delegate.shutdown(); }
>>
>>
>>
>> public AbstractCompactionTask getNextBackgroundTask(int gcBefore) {
>> return delegate.getNextBackgroundTask(gcBefore); }
>> public synchronized Collection
>> getMaximalTask(int gcBefore, boolean splitOutput) { return
>> delegate.getMaximalTask(gcBefore, splitOutput); }
>> public synchronized AbstractCompactionTask
>> getUserDefinedTask(Collection sstables, int gcBefore) {
>> return getUserDefinedTask(sstables,gcBefore); }
>> public int getEstimatedRemainingTasks() { return
>> delegate.getEstimatedRemainingTasks(); }
>> public long getMaxSSTableBytes() { return
>> delegate.getMaxSSTableBytes(); }
>> public ScannerList getScanners(Collection sstables,
>> Range range) { return delegate.getScanners(sstables,range); }
>> public ScannerList getScanners(Collection sstables,
>> Collection> ranges) { return delegate.getScanners(sstables,
>> ranges);}
>> public ScannerList getScanners(Collection toCompact) {
>> return delegate.getScanners(toCompact); }
>> protected boolean worthDroppingTombstones(SSTableReader sstable, int
>> gcBefore) { return delegate.worthDroppingTombstones(sstable,gcBefore); }
>>
>> public boolean shouldDefragment() { return
>> delegate.shouldDefragment(); }
>> public synchronized void replaceSSTables(Collection
>> removed, Collection added) {
>> delegate.replaceSSTables(removed,added); }
>> public Collection>
>> groupSSTablesForAntiCompaction(Collection sstablesToGroup) {
>> delegate.groupSSTablesForAntiCOmpaction(sstablesToGroup); }
>> public CompactionLogger.Strategy strategyLogger() { return
>> delegate.strategyLogger(); }
>> public SSTableMultiWriter createSSTableMultiWriter(Descriptor
>> descriptor,
>>long keyCount,
>>long repairedAt,
>>UUID pendingRepair,
>>boolean
>> isTransient,
>>MetadataCollector
>> meta,
>>
>>  SerializationHeader header,
>>Collection
>> indexes,
>>
>>  LifecycleNewTracker lifecycleNewTracker) {
>> return delefate.createSSTableMultiWriter(descriptor, keyCount,
>> repairedAt, pendingRepair, isTransient, meta, header, indexes,
>> lifecycleNewTracker);
>> }
>>
>> public boolean supportsEarlyOpen() { return
>> delegate.supportsEarlyOpen(); }
>>
>>
>>
>>
>> public String getName() { return getClass().getSimpleName(); } //
>> don't delegate beca

Re: upgrading from 2.x TWCS to 3.x TWCS

2019-09-27 Thread Carl Mueller
Is this still the official answer on TWCS 2.X --> 3.X upgrades? Pull the
code and recompile as a different package?

Can I just declare the necessary class and package namespace and delegate
to the actual main-codebase class?


On Mon, Nov 5, 2018 at 1:41 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Sat, Nov 3, 2018 at 1:13 AM Brian Spindler 
> wrote:
>
>> That wasn't horrible at all.  After testing, provided all goes well I can
>> submit this back to the main TWCS repo if you think it's worth it.
>>
>> Either way do you mind just reviewing briefly for obvious mistakes?
>>
>>
>> https://github.com/bspindler/twcs/commit/7ba388dbf41b1c9dc1b70661ad69273b258139da
>>
>
> About almost a year ago we were migrating from 2.1 to 3.0 and we figured
> out that Jeff's master branch didn't compile with 3.0, but the change to
> get it running was really minimal:
>
> https://github.com/a1exsh/twcs/commit/10ee91c6f409aa249c8d439f7670d8b997ab0869
>
> So we built that jar, added it to the packaged 3.0 and we were good to
> go.  Maybe you might want to consider migrating in two steps: 2.1 -> 3.0,
> ALTER TABLE, upgradesstables, 3.0 -> 3.1.
>
> And huge thanks to Jeff for coming up with TWCS in the first place! :-)
>
> Cheers,
> --
> Alex
>
>


Re: upgrading from 2.x TWCS to 3.x TWCS

2019-09-27 Thread Carl Mueller
So IF that delegate class would work:

1) create jar with the delegate class
2) deploy jar along with upgrade on node
3) once all nodes are upgraded, issue ALTER to change to the
org.apache.cassandra TWCS class.

will that trigger full recompaction?

On Fri, Sep 27, 2019 at 12:25 PM Carl Mueller 
wrote:

> For example (delegating all public methods from AbstractCompactionStrategy
> and some from TimeWindowCompactionStrategy EXCEPT getName() )
>
> package com.jeffjirsa.cassandra.db.compaction;
>
> public class TimeWindowCompactionStrategy extends
> AbstractCompactionStrategy
> {
> org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy
> delegate;
>
> public TimeWindowCompactionStrategy(ColumnFamilyStore cfs,
> Map options) {
> delegate = new
> org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy(cfs,options);
> }
>
>
> public Directories getDirectories() { return
> delegate.getDirectories(); }
> public synchronized void pause() { delegate.pause(); }
> public synchronized void resume() { delegate.resume(); }
> public synchronized void startup() { delegate.startup(); }
> public synchronized void shutdown() { delegate.shutdown(); }
>
>
>
> public AbstractCompactionTask getNextBackgroundTask(int gcBefore) {
> return delegate.getNextBackgroundTask(gcBefore); }
> public synchronized Collection
> getMaximalTask(int gcBefore, boolean splitOutput) { return
> delegate.getMaximalTask(gcBefore, splitOutput); }
> public synchronized AbstractCompactionTask
> getUserDefinedTask(Collection sstables, int gcBefore) {
> return getUserDefinedTask(sstables,gcBefore); }
> public int getEstimatedRemainingTasks() { return
> delegate.getEstimatedRemainingTasks(); }
> public long getMaxSSTableBytes() { return
> delegate.getMaxSSTableBytes(); }
> public ScannerList getScanners(Collection sstables,
> Range range) { return delegate.getScanners(sstables,range); }
> public ScannerList getScanners(Collection sstables,
> Collection> ranges) { return delegate.getScanners(sstables,
> ranges);}
> public ScannerList getScanners(Collection toCompact) {
> return delegate.getScanners(toCompact); }
> protected boolean worthDroppingTombstones(SSTableReader sstable, int
> gcBefore) { return delegate.worthDroppingTombstones(sstable,gcBefore); }
>
> public boolean shouldDefragment() { return
> delegate.shouldDefragment(); }
> public synchronized void replaceSSTables(Collection
> removed, Collection added) {
> delegate.replaceSSTables(removed,added); }
> public Collection>
> groupSSTablesForAntiCompaction(Collection sstablesToGroup) {
> delegate.groupSSTablesForAntiCOmpaction(sstablesToGroup); }
> public CompactionLogger.Strategy strategyLogger() { return
> delegate.strategyLogger(); }
> public SSTableMultiWriter createSSTableMultiWriter(Descriptor
> descriptor,
>long keyCount,
>long repairedAt,
>UUID pendingRepair,
>boolean isTransient,
>MetadataCollector
> meta,
>SerializationHeader
> header,
>Collection
> indexes,
>LifecycleNewTracker
> lifecycleNewTracker) {
> return delefate.createSSTableMultiWriter(descriptor, keyCount,
> repairedAt, pendingRepair, isTransient, meta, header, indexes,
> lifecycleNewTracker);
> }
>
> public boolean supportsEarlyOpen() { return
> delegate.supportsEarlyOpen(); }
>
>
>
>
> public String getName() { return getClass().getSimpleName(); } //
> don't delegate because this is probably the name used in the column family
> definition.
>
>
>
>
> public synchronized void addSSTable(SSTableReader sstable) {
> delegate.addSSTable(sstable); }
> public synchronized void addSSTables(Iterable added) {
> delegate.addSSTables(added); }
>
> public synchronized void removeSSTable(SSTableReader sstable) {
> delegate.removeSSTable(sstable); }
> public void removeSSTables(Iterable removed) {
> delegate.removeSSTables(removed); }
>
> public void metadataChanged(StatsMetadata oldMetadata, SSTableReader
> sstable) { delegate.metadataChanges(oldMetadata,sstable); }
>
>
> protected Set getSSTables() { return
> delegate.getSSTables(); }
> public String toString() {return delegate.toString(); }
>
>
> public 

Re: upgrading from 2.x TWCS to 3.x TWCS

2019-09-27 Thread Carl Mueller
Oops, I think the getName() is an important thing:

package com.jeffjirsa.cassandra.db.compaction;

import org.apache.cassandra.db.ColumnFamilyStore;

import java.util.Map;

public class TimeWindowCompactionStrategy extends
org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy {
public TimeWindowCompactionStrategy(ColumnFamilyStore cfs,
Map options) {
super(cfs,options);
}

public String getName()
{
return getClass().getSimpleName();
}
}


On Fri, Sep 27, 2019 at 1:05 PM Carl Mueller 
wrote:

> Or can we just do this safely in a side jar?
>
> package com.jeffjirsa.cassandra.db.compaction;
>
> import org.apache.cassandra.db.ColumnFamilyStore;
> import org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy;
>
> import java.util.Map;
>
> public class TimeWindowCompactionStrategy extends 
> TimeWindowCompactionStrategy {
> public TimeWindowCompactionStrategy(ColumnFamilyStore cfs, Map String> options) {
> super(cfs,options);
> }
> }
>
>
> On Fri, Sep 27, 2019 at 12:29 PM Carl Mueller <
> carl.muel...@smartthings.com> wrote:
>
>> So IF that delegate class would work:
>>
>> 1) create jar with the delegate class
>> 2) deploy jar along with upgrade on node
>> 3) once all nodes are upgraded, issue ALTER to change to the
>> org.apache.cassandra TWCS class.
>>
>> will that trigger full recompaction?
>>
>> On Fri, Sep 27, 2019 at 12:25 PM Carl Mueller <
>> carl.muel...@smartthings.com> wrote:
>>
>>> For example (delegating all public methods from
>>> AbstractCompactionStrategy and some from TimeWindowCompactionStrategy
>>> EXCEPT getName() )
>>>
>>> package com.jeffjirsa.cassandra.db.compaction;
>>>
>>> public class TimeWindowCompactionStrategy extends
>>> AbstractCompactionStrategy
>>> {
>>> org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy
>>> delegate;
>>>
>>> public TimeWindowCompactionStrategy(ColumnFamilyStore cfs,
>>> Map options) {
>>> delegate = new
>>> org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy(cfs,options);
>>> }
>>>
>>>
>>> public Directories getDirectories() { return
>>> delegate.getDirectories(); }
>>> public synchronized void pause() { delegate.pause(); }
>>> public synchronized void resume() { delegate.resume(); }
>>> public synchronized void startup() { delegate.startup(); }
>>> public synchronized void shutdown() { delegate.shutdown(); }
>>>
>>>
>>>
>>> public AbstractCompactionTask getNextBackgroundTask(int gcBefore) {
>>> return delegate.getNextBackgroundTask(gcBefore); }
>>> public synchronized Collection
>>> getMaximalTask(int gcBefore, boolean splitOutput) { return
>>> delegate.getMaximalTask(gcBefore, splitOutput); }
>>> public synchronized AbstractCompactionTask
>>> getUserDefinedTask(Collection sstables, int gcBefore) {
>>> return getUserDefinedTask(sstables,gcBefore); }
>>> public int getEstimatedRemainingTasks() { return
>>> delegate.getEstimatedRemainingTasks(); }
>>> public long getMaxSSTableBytes() { return
>>> delegate.getMaxSSTableBytes(); }
>>> public ScannerList getScanners(Collection sstables,
>>> Range range) { return delegate.getScanners(sstables,range); }
>>> public ScannerList getScanners(Collection sstables,
>>> Collection> ranges) { return delegate.getScanners(sstables,
>>> ranges);}
>>> public ScannerList getScanners(Collection toCompact)
>>> { return delegate.getScanners(toCompact); }
>>> protected boolean worthDroppingTombstones(SSTableReader sstable, int
>>> gcBefore) { return delegate.worthDroppingTombstones(sstable,gcBefore); }
>>>
>>> public boolean shouldDefragment() { return
>>> delegate.shouldDefragment(); }
>>> public synchronized void replaceSSTables(Collection
>>> removed, Collection added) {
>>> delegate.replaceSSTables(removed,added); }
>>> public Collection>
>>> groupSSTablesForAntiCompaction(Collection sstablesToGroup) {
>>> delegate.groupSSTablesForAntiCOmpaction(sstablesToGroup); }
>>> public CompactionLogger.Strategy strategyLogger() { return
>>> delegate.strategyLogger(); }
>>> public SSTableMultiWriter createSSTableMultiWriter(Descriptor
>>> descriptor,
>>>long keyCount,
>>>   

Re: upgrading from 2.x TWCS to 3.x TWCS

2019-09-27 Thread Carl Mueller
For example (delegating all public methods from AbstractCompactionStrategy
and some from TimeWindowCompactionStrategy EXCEPT getName() )

package com.jeffjirsa.cassandra.db.compaction;

public class TimeWindowCompactionStrategy extends AbstractCompactionStrategy
{
org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy
delegate;

public TimeWindowCompactionStrategy(ColumnFamilyStore cfs,
Map options) {
delegate = new
org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy(cfs,options);
}


public Directories getDirectories() { return delegate.getDirectories();
}
public synchronized void pause() { delegate.pause(); }
public synchronized void resume() { delegate.resume(); }
public synchronized void startup() { delegate.startup(); }
public synchronized void shutdown() { delegate.shutdown(); }



public AbstractCompactionTask getNextBackgroundTask(int gcBefore) {
return delegate.getNextBackgroundTask(gcBefore); }
public synchronized Collection
getMaximalTask(int gcBefore, boolean splitOutput) { return
delegate.getMaximalTask(gcBefore, splitOutput); }
public synchronized AbstractCompactionTask
getUserDefinedTask(Collection sstables, int gcBefore) {
return getUserDefinedTask(sstables,gcBefore); }
public int getEstimatedRemainingTasks() { return
delegate.getEstimatedRemainingTasks(); }
public long getMaxSSTableBytes() { return
delegate.getMaxSSTableBytes(); }
public ScannerList getScanners(Collection sstables,
Range range) { return delegate.getScanners(sstables,range); }
public ScannerList getScanners(Collection sstables,
Collection> ranges) { return delegate.getScanners(sstables,
ranges);}
public ScannerList getScanners(Collection toCompact) {
return delegate.getScanners(toCompact); }
protected boolean worthDroppingTombstones(SSTableReader sstable, int
gcBefore) { return delegate.worthDroppingTombstones(sstable,gcBefore); }

public boolean shouldDefragment() { return delegate.shouldDefragment();
}
public synchronized void replaceSSTables(Collection
removed, Collection added) {
delegate.replaceSSTables(removed,added); }
public Collection>
groupSSTablesForAntiCompaction(Collection sstablesToGroup) {
delegate.groupSSTablesForAntiCOmpaction(sstablesToGroup); }
public CompactionLogger.Strategy strategyLogger() { return
delegate.strategyLogger(); }
public SSTableMultiWriter createSSTableMultiWriter(Descriptor
descriptor,
   long keyCount,
   long repairedAt,
   UUID pendingRepair,
   boolean isTransient,
   MetadataCollector
meta,
   SerializationHeader
header,
   Collection
indexes,
   LifecycleNewTracker
lifecycleNewTracker) {
return delefate.createSSTableMultiWriter(descriptor, keyCount,
repairedAt, pendingRepair, isTransient, meta, header, indexes,
lifecycleNewTracker);
}

public boolean supportsEarlyOpen() { return
delegate.supportsEarlyOpen(); }




public String getName() { return getClass().getSimpleName(); } // don't
delegate because this is probably the name used in the column family
definition.




public synchronized void addSSTable(SSTableReader sstable) {
delegate.addSSTable(sstable); }
public synchronized void addSSTables(Iterable added) {
delegate.addSSTables(added); }

public synchronized void removeSSTable(SSTableReader sstable) {
delegate.removeSSTable(sstable); }
public void removeSSTables(Iterable removed) {
delegate.removeSSTables(removed); }

public void metadataChanged(StatsMetadata oldMetadata, SSTableReader
sstable) { delegate.metadataChanges(oldMetadata,sstable); }


protected Set getSSTables() { return
delegate.getSSTables(); }
public String toString() {return delegate.toString(); }


public static Map validateOptions(Map
options) throws ConfigurationException { return
delegate.validateOptions(options); }

}

On Fri, Sep 27, 2019 at 11:58 AM Carl Mueller 
wrote:

> Is this still the official answer on TWCS 2.X --> 3.X upgrades? Pull the
> code and recompile as a different package?
>
> Can I just declare the necessary class and package namespace and delegate
> to the actual main-codebase class?
>
>
> On Mon, Nov 5, 2018 at 1:41 AM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Sat, Nov 3, 2018 at 1:13 AM Brian Spindler 
>> wrote:
>>
>>> That wasn't horrible at all.  After testing, provided all goes well I
>>> can submit this back to the main TWCS repo if you think it's worth it.
>>>
>>> Either way do 

Re: upgrading from 2.x TWCS to 3.x TWCS

2019-09-30 Thread Carl Mueller
sweet

Ugh, for the test I just realized that sstableloader probably wont' produce
buckets corresponding with the actual insertion time of the data in twcs.

Well, we can still run the test.

On Mon, Sep 30, 2019 at 2:47 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Fri, Sep 27, 2019 at 7:39 PM Carl Mueller
>  wrote:
>
>> So IF that delegate class would work:
>>
>> 1) create jar with the delegate class
>> 2) deploy jar along with upgrade on node
>> 3) once all nodes are upgraded, issue ALTER to change to the
>> org.apache.cassandra TWCS class.
>>
>
> Yes, this used to work for us in the same situation.
>
> will that trigger full recompaction?
>>
>
> Nope. :-)
>
> --
> Alex
>
>


TWCS: what happens on node replacement/streaming

2019-07-06 Thread Carl Mueller
TWCS distributes it data by time buckets/flushes

But on node add/streaming, it doesn't have the natural ordering provided by
the timing of the incoming update sterams.

So does TWCS properly reconsturct buckets on streaming/replacement?


Re: TWCS: what happens on node replacement/streaming

2019-07-06 Thread Carl Mueller
THanks jeff

On Sat, Jul 6, 2019 at 12:16 PM Jeff Jirsa  wrote:

> The max timestamp for each sstable is in the metadata on each sstable, so
> on streaming of any kind (bootstrap, repair, etc) sstables are added to
> their corrrect and expected windows.
>
>
>
> > On Jul 6, 2019, at 10:09 AM, Carl Mueller 
> > 
> wrote:
> >
> > TWCS distributes it data by time buckets/flushes
> >
> > But on node add/streaming, it doesn't have the natural ordering provided
> by the timing of the incoming update sterams.
> >
> > So does TWCS properly reconsturct buckets on streaming/replacement?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: AWS ephemeral instances + backup

2019-12-09 Thread Carl Mueller
Jeff: the gp2 drives are expensive, especially if you have to make them
unnecessarily large to get the IOPS, and I want to get cheap per node as
possible to get as many nodes as possible.

i3 + a cheap rust backup beats an m5 or similar one + EBS gp2 in cost when
i did the numbers

Ben: Going to s3 would be even cheaper and probably about the same speed, I
think I was avoiding it for the network cost and throttling/not throttling,
but if it is cheap enough vs the rust EBS then I'll do that. I think I came
across your page when doing earlier research.

Jon: I have my own thing that is very similar to medusa but supports our
wonky various modes of access (bastions, ipv6, etc). Very similar with
comparative incremental backups and the like. The backups run at scheduled
times, but my rewrite would enable a more local strategy by watching the
sstabledirs. The restore modes of medusa are better in some respects, but I
can do more complicated things too. I'm trying to abstract access mode
(k8/ssh/etc), cloud, and even tech (kafka/cassandra) in a rewrite and it is
damn hard to avoid leakage of abstractions

Reid: possibly we could but the ebs snapshot needs to do the 100G's every
time, while various sstable copies/incremental backups just do the new
files so the raw amount of bits being saved is just faster and more
resiliant

Thank you everyone, at least with all you bigwigs giving advice I can argue
from appeal to authority to management :-) (which is always more effective
than arguing from reason or evidence)


On Fri, Dec 6, 2019 at 9:18 AM Reid Pinchback 
wrote:

> Correction:  “most of your database will be in chunk cache, or buffer
> cache anyways.
>
>
>
> *From: *Reid Pinchback 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Friday, December 6, 2019 at 10:16 AM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: AWS ephemeral instances + backup
>
>
>
> *Message from External Sender*
>
> If you’re only going to have a small storage footprint per node like
> 100gb, another option comes to mind. Use an instance type with large ram.
> Use an EBS storage volume on an EBS-optimized instance type, and take EBS
> snapshots. Most of your database will be in chunk cache anyways, so you
> only need to make sure that the dirty background writer is keeping up.  I’d
> take a look at iowait during a snapshot and see if the results are
> acceptable for a running node.  Even if it is marginal, if you’re only
> snapshotting one node at a time, then speculative retry would just skip
> over the temporary slowpoke.
>
>
>
> *From: *Carl Mueller 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Thursday, December 5, 2019 at 3:21 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *AWS ephemeral instances + backup
>
>
>
> *Message from External Sender*
>
> Does anyone have experience tooling written to support this strategy:
>
> Use case: run cassandra on i3 instances on ephemerals but synchronize the
> sstables and commitlog files to the cheapest EBS volume type (those have
> bad IOPS but decent enough throughput)
>
> On node replace, the startup script for the node, back-copies the sstables
> and commitlog state from the EBS to the ephemeral.
>
> As can be seen:
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.aws.amazon.com_AWSEC2_latest_UserGuide_EBSVolumeTypes.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=vReT2cww6MdAQWz8b6u96QUK08ufU_4uP3X-zH4CyTc=CXEcXQAHUhdV8CrzCfURvvW9qRDp_Ji9TvbUgVwKxhA=>
>
> the (presumably) spinning rust tops out at 2375 MB/sec (using multiple EBS
> volumes presumably) that would incur about a ten minute delay for node
> replacement for a 1TB node, but I imagine this would only be used on higher
> IOPS r/w nodes with smaller densities, so 100GB would be about a minute of
> delay only, already within the timeframes of an AWS node
> replacement/instance restart.
>
>
>


Re: Seeing tons of DigestMismatchException exceptions after upgrading from 2.2.13 to 3.11.4

2019-12-09 Thread Carl Mueller
My speculation on rapidly churning/fast reads of recently written data:

- data written at quorum (for RF3): write confirm is after two nodes reply
- data read very soon after (possibly code antipattern), and let's assume
the third node update hasn't completed yet (e.g. AWS network "variance").
The read will pick a replica, and then there is a 50% chance the second
replica chosen for quorum read is the stale node, which triggers a
DigestMismatch read repair.

Is that plausible?

The code seems to log the exception in all read repair instances, so it
doesn't seem to be an ERROR with red blaring klaxons, maybe it should be a
WARN?

On Mon, Nov 25, 2019 at 11:12 AM Colleen Velo  wrote:

> Hello,
>
> As part of the final stages of our 2.2 --> 3.11 upgrades, one of our
> clusters (on AWS/ 18 nodes/ m4.2xlarge) produced some post-upgrade fits. We
> started getting spikes of Cassandra read and write timeouts despite the
> fact the overall metrics volumes were unchanged. As part of the upgrade
> process, there was a TWCS table that we used a facade implementation to
> help change the namespace of the compaction class, but that has very low
> query volume.
>
> The DigestMismatchException error messages, (based on sampling the hash
> keys and finding which tables have partitions for that hash key), seem to
> be occurring on the heaviest volume table (4,000 reads, 1600 writes per
> second per node approximately), and that table has semi-medium row widths
> with about 10-40 column keys. (Or at least the digest mismatch partitions
> have that type of width). The keyspace is an RF3 using NetworkTopology, the
> CL is QUORUM for both reads and writes.
>
> We have experienced the DigestMismatchException errors on all 3 of the
> Production clusters that we have upgraded (all of them are single DC in the
> us-east-1/eu-west-1/ap-northeast-2 AWS regions) and in all three cases,
> those DigestMismatchException errors were not there in either the  2.1.x or
> 2.2.x versions of Cassandra.
> Does anyone know of changes from 2.2 to 3.11 that would produce additional
> timeout problems, such as heavier blocking read repair logic?  Also,
>
> We ran repairs (via reaper v1.4.8) (much nicer in 3.11 than 2.1) on all of
> the tables and across all of the nodes, and our timeouts seemed to have
> disappeared, but we continue to see a rapid streaming of the Digest
> mismatches exceptions, so much so that our Cassandra debug logs are rolling
> over every 15 minutes..   There is a mail list post from 2018 that
> indicates that some DigestMismatchException error messages are natural if
> you are reading while writing, but the sheer volume that we are getting is
> very concerning:
>  - https://www.mail-archive.com/user@cassandra.apache.org/msg56078.html
>
> Is that level of DigestMismatchException unusual? Or is can that volume of
> mismatches appear if semi-wide rows simply require a lot of resolution
> because flurries of quorum reads/writes (RF3) on recent partitions have a
> decent chance of not having fully synced data on the replica reads? Does
> the digest mismatch error get debug-logged on every chance read repair?
> (edited)
> Also, why are these DigestMismatchException only occurring once the
> upgrade to 3.11 has occurred?
>
>
> ~
>
> Sample DigestMismatchException error message:
> DEBUG [ReadRepairStage:13] 2019-11-22 01:38:14,448
> ReadCallback.java:242 - Digest mismatch:
> org.apache.cassandra.service.DigestMismatchException: Mismatch for key
> DecoratedKey(-6492169518344121155,
> 66306139353831322d323064382d313037322d663965632d636565663165326563303965)
> (be2c0feaa60d99c388f9d273fdc360f7 vs 09eaded2d69cf2dd49718076edf56b36)
> at
> org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
> ~[apache-cassandra-3.11.4.jar:3.11.4]
> at
> org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
> ~[apache-cassandra-3.11.4.jar:3.11.4]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_77]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_77]
> at
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:81)
> [apache-cassandra-3.11.4.jar:3.11.4]
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_77]
>
> Cluster(s) setup:
> * AWS region: eu-west-1:
> — Nodes: 18
> — single DC
> — keyspace: RF3 using NetworkTopology
>
> * AWS region: us-east-1:
> — Nodes: 20
> — single DC
> — keyspace: RF3 using NetworkTopology
>
> * AWS region: ap-northeast-2:
> — Nodes: 30
> — single DC
> — keyspace: RF3 using NetworkTopology
>
> Thanks for any insight into this issue.
>
> --
>
> *Colleen Veloemail: 

Re: TTL on UDT

2019-12-09 Thread Carl Mueller
Oh right frozen vs unfrozen.

On Mon, Dec 9, 2019 at 2:23 PM DuyHai Doan  wrote:

> It depends on.. Latest version of Cassandra allows unfrozen UDT. The
> individual fields of UDT are updated atomically and they are stored
> effectively in distinct physical columns inside the partition, thus
> applying ttl() on them makes sense. I'm not sure however if the CQL parser
> allows this syntax
>
> On Mon, Dec 9, 2019 at 9:13 PM Carl Mueller
>  wrote:
>
>> I could be wrong, but UDTs I think are written (and overwritten) as one
>> unit, so the notion of a TTL on a UDT field doesn't exist, the TTL is
>> applied to the overall structure.
>>
>> Think of it like a serialized json object with multiple fields. To update
>> a field they deserialize the json, then reserialize the json with the new
>> value, and the whole json object has the new timestamp or ttl.
>>
>> On Tue, Dec 3, 2019 at 10:02 AM Mark Furlong 
>> wrote:
>>
>>> When I run the command ‘select ttl(udt_field) from table; I’m getting an
>>> error ‘InvalidRequest: Error from server: code=2200 [Invalid query]
>>> message="Cannot use selection function ttl on collections"’. How can I get
>>> the TTL from a UDT field?
>>>
>>>
>>>
>>> *Mark Furlong*
>>>
>>>
>>>
>>>
>>>
>>> We empower journeys of personal discovery to enrich lives
>>>
>>>
>>>
>>>
>>>
>>


Re: TTL on UDT

2019-12-09 Thread Carl Mueller
I could be wrong, but UDTs I think are written (and overwritten) as one
unit, so the notion of a TTL on a UDT field doesn't exist, the TTL is
applied to the overall structure.

Think of it like a serialized json object with multiple fields. To update a
field they deserialize the json, then reserialize the json with the new
value, and the whole json object has the new timestamp or ttl.

On Tue, Dec 3, 2019 at 10:02 AM Mark Furlong  wrote:

> When I run the command ‘select ttl(udt_field) from table; I’m getting an
> error ‘InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Cannot use selection function ttl on collections"’. How can I get
> the TTL from a UDT field?
>
>
>
> *Mark Furlong*
>
>
>
>
>
> We empower journeys of personal discovery to enrich lives
>
>
>
>
>


Dynamo autoscaling: does it beat cassandra?

2019-12-09 Thread Carl Mueller
Dynamo salespeople have been pushing autoscaling abilities that have been
one of the key temptations to our management to switch off of cassandra.

Has anyone done any numbers on how well dynamo will autoscale demand
spikes, and how we could architect cassandra to compete with such abilities?

We probably could overprovision and with the presumably higher cost of
dynamo beat it, although the sales engineers claim they are closing the
cost factor too. We could vertically scale to some degree, but node
expansion seems close.

VNode expansion is still limited to one at a time?

We use VNodes so we can't do netflix's cluster doubling, correct? With cass
4.0's alleged segregation of the data by token we could though and possibly
also "prep" the node by having the necessary sstables already present ahead
of time?

There's always "caching" too, but there isn't a lot of data on general
fronting of cassandra with caches, and the row cache continues to be mostly
useless?


  1   2   >