from:"kurt greaves"

Re: Tombstone removal optimization and question

2018-11-06 Thread kurt greaves

Yes it does. Consider if it didn't and you kept writing to the same
partition, you'd never be able to remove any tombstones for that partition.

On Tue., 6 Nov. 2018, 19:40 DuyHai Doan  Hello all
>
> I have tried to sum up all rules related to tombstone removal:
>
>
> --
>
> Given a tombstone written at timestamp (t) for a partition key (P) in
> SSTable (S1). This tombstone will be removed:
>
> 1) after gc_grace_seconds period has passed
> 2) at the next compaction round, if SSTable S1 is selected (not at all
> guaranteed because compaction is not deterministic)
> 3) if the partition key (P) is not present in any other SSTable that is
> NOT picked by the current round of compaction
>
> Rule 3) is quite complex to understand so here is the detailed explanation:
>
> If Partition Key (P) also exists in another SSTable (S2) that is NOT
> compacted together with SSTable (S1), if we remove the tombstone, there is
> some data in S2 that may resurrect.
>
> Precisely, at compaction time, Cassandra does not have ANY detail about
> Partition (P) that stays in S2 so it cannot remove the tombstone right away.
>
> Now, for each SSTable, we have some metadata, namely minTimestamp and
> maxTimestamp.
>
> I wonder if the current compaction optimization does use/leverage this
> metadata for tombstone removal. Indeed if we know that tombstone timestamp
> (t) < minTimestamp, it can be safely removed.
>
> Does someone has the info ?
>
> Regards
>
>
>

[ANNOUNCE] StratIO's Lucene plugin fork

2018-10-18 Thread kurt greaves

Hi all,

We've had confirmation from Stratio that they are no longer maintaining
their Lucene plugin for Apache Cassandra. We've thus decided to fork the
plugin to continue maintaining it. At this stage we won't be making any
additions to the plugin in the short term unless absolutely necessary, and
as 4.0 nears we'll begin making it compatible with the new major release.
We plan on taking the existing PR's and issues from the Stratio repository
and getting them merged/resolved, however this likely won't happen until
early next year. Having said that, we welcome all contributions and will
dedicate time to reviewing bugs in the current versions if people lodge
them and can help.

I'll note that this is new ground for us, we don't have much existing
knowledge of the plugin but are determined to learn. If anyone out there
has established knowledge about the plugin we'd be grateful for any
assistance!

You can find our fork here:
https://github.com/instaclustr/cassandra-lucene-index
At the moment, the only difference is that there is a 3.11.3 branch which
just has some minor changes to dependencies to better support 3.11.3.

Cheers,
Kurt

Re: SSTableMetadata Util

2018-10-01 Thread kurt greaves

Pranay,

3.11.3 should include all the C* binaries in /usr/bin. Maybe try
reinstalling? Sounds like something got messed up along the way.

Kurt

On Tue, 2 Oct 2018 at 12:45, Pranay akula 
wrote:

> Thanks Christophe,
>
> I have installed using rpm package I actually ran locate command to find
> the sstable utils I could find only those 4
>
> Probably I may need to manually copy them.
>
> Regards
> Pranay
>
> On Mon, Oct 1, 2018, 9:01 PM Christophe Schmitz <
> christo...@instaclustr.com> wrote:
>
>> Hi Pranay,
>>
>> The sstablemetadata is still available in the tarball file
>> ($CASSANDRA_HOME/tools/bin) in 3.11.3. Not sure why it is not available in
>> your packaged installation, you might want to manually copy the one from
>> the package to your /usr/bin/
>>
>> Additionaly, you can have a look at
>> https://github.com/instaclustr/cassandra-sstable-tools which will
>> provided you with the desired info, plus more info you might find useful.
>>
>>
>> Christophe Schmitz - Instaclustr  -
>> Cassandra | Kafka | Spark Consulting
>>
>>
>>
>>
>>
>> On Tue, 2 Oct 2018 at 11:31 Pranay akula 
>> wrote:
>>
>>> Hi,
>>>
>>> I am testing apache 3.11.3 i couldn't find sstablemetadata util
>>>
>>> All i can see is only these utilities in /usr/bin
>>>
>>> -rwxr-xr-x.   1 root root2042 Jul 25 06:12 sstableverify
>>> -rwxr-xr-x.   1 root root2045 Jul 25 06:12 sstableutil
>>> -rwxr-xr-x.   1 root root2042 Jul 25 06:12 sstableupgrade
>>> -rwxr-xr-x.   1 root root2042 Jul 25 06:12 sstablescrub
>>> -rwxr-xr-x.   1 root root2034 Jul 25 06:12 sstableloader
>>>
>>>
>>> If this utility is no longer available how can i get sstable metadata
>>> like repaired_at, Estimated droppable tombstones
>>>
>>>
>>> Thanks
>>> Pranay
>>>
>>

Re: TWCS + subrange repair = excessive re-compaction?

2018-09-26 Thread kurt greaves

Not any faster, as you'll still have to wait for all the SSTables to age
off, as a partition level tombstone will simply go to a new SSTable and
likely will not be compacted with the old SSTables.

On Tue, 25 Sep 2018 at 17:03, Martin Mačura  wrote:

> Most partitions in our dataset span one or two SSTables at most.  But
> there might be a few that span hundreds of SSTables.  If I located and
> deleted them (partition-level tombstone), would this fix the issue?
>
> Thanks,
>
> Martin
> On Mon, Sep 24, 2018 at 1:08 PM Jeff Jirsa  wrote:
> >
> >
> >
> >
> > On Sep 24, 2018, at 3:47 AM, Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
> >
> > On Mon, Sep 24, 2018 at 10:50 AM Jeff Jirsa  wrote:
> >>
> >> Do your partitions span time windows?
> >
> >
> > Yes.
> >
> >
> > The data structure used to know if data needs to be streamed (the merkle
> tree) is only granular to - at best - a token, so even with subrange repair
> if a byte is off, it’ll stream the whole partition, including parts of old
> repaired sstables
> >
> > Incremental repair is smart enough not to diff or stream already
> repaired data, the but the matrix of which versions allow subrange AND
> incremental repair isn’t something I’ve memorized (I know it behaves the
> way you’d hope in trunk/4.0 after Cassandra-9143)
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: node replacement failed

2018-09-22 Thread kurt greaves

I don't like your cunning plan. Don't drop the system auth and distributed
keyspaces, instead just change them to NTS and then do your replacement for
each down node.

 If you're actually using auth and worried about consistency I believe 3.11
has the feature to be able to exclude nodes during a repair which you could
use just to repair the auth keyspace.
But if you're not using auth go ahead and change them and then do all your
replaces is the best method of recovery here.

On Sun., 23 Sep. 2018, 00:33 onmstester onmstester, 
wrote:

> Another question,
> Is there a management tool to do nodetool cleanup one by one (wait until
> finish of cleaning up one node then start clean up for the next node in
> cluster)?
>  On Sat, 22 Sep 2018 16:02:17 +0330 *onmstester onmstester
> >* wrote 
>
> I have a cunning plan (Baldrick wise) to solve this problem:
>
>- stop client application
>- run nodetool flush on all nodes to save memtables to disk
>- stop cassandra on all of the nodes
>- rename original Cassandra data directory to data-old
>- start cassandra on all the nodes to create a fresh cluster including
>the old dead nodes
>- again create the application related keyspaces in cqlsh and this
>time set rf=2 on system keyspaces (to never encounter this problem again!)
>- move sstables from data-backup dir to current data dirs and restart
>cassandra or reload sstables
>
>
> Should this work and solve my problem?
>
>
>  On Mon, 10 Sep 2018 17:12:48 +0430 *onmstester onmstester
> >* wrote 
>
>
>
> Thanks Alain,
> First here it is more detail about my cluster:
>
>- 10 racks + 3 nodes on each rack
>- nodetool status: shows 27 nodes UN and 3 nodes all related to single
>rack as DN
>- version 3.11.2
>
> *Option 1: (Change schema and) use replace method (preferred method)*
> * Did you try to have the replace going, without any former repairs,
> ignoring the fact 'system_traces' might be inconsistent? You probably don't
> care about this table, so if Cassandra allows it with some of the nodes
> down, going this way is relatively safe probably. I really do not see what
> you could lose that matters in this table.
> * Another option, if the schema first change was accepted, is to make the
> second one, to drop this table. You can always rebuild it in case you need
> it I assume.
>
> I really love to let the replace going, but it stops with the error:
>
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
>
> Also i could delete system_traces which is empty anyway, but there is a
> system_auth and system_distributed keyspace too and they are not empty,
> Could i delete them safely too?
> If i could just somehow skip streaming the system keyspaces from node
> replace phase, the option 1 would be great.
>
> P.S: Its clear to me that i should use at least RF=3 in production, but
> could not manage to acquire enough resources yet (i hope would be fixed in
> recent future)
>
> Again Thank you for your time
>
> Sent using Zoho Mail 
>
>
>  On Mon, 10 Sep 2018 16:20:10 +0430 *Alain RODRIGUEZ
> >* wrote 
>
>
>
> Hello,
>
> I am sorry it took us (the community) more than a day to answer to this
> rather critical situation. That being said, my recommendation at this point
> would be for you to make sure about the impacts of whatever you would try.
> Working on a broken cluster, as an emergency might lead you to a second
> mistake, possibly more destructive than the first one. It happened to me
> and around, for many clusters. Move forward even more carefuly in these
> situations as a global advice.
>
> Suddenly i lost all disks of cassandar-data on one of my racks
>
>
> With RF=2, I guess operations use LOCAL_ONE consistency, thus you should
> have all the data in the safe rack(s) with your configuration, you probably
> did not lose anything yet and have the service only using the nodes up,
> that got the right data.
>
>  tried to replace the nodes with same ip using this:
>
> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
>
>
> As a side note, I would recommend you to use 'replace_address_first_boot'
> instead of 'replace_address'. This does basically the same but will be
> ignored after the first bootstrap. A detail, but hey, it's there and
> somewhat safer, I would use this one.
>
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
>
> By default, non-user keyspace use 'SimpleStrategy' and a small RF.
> Ideally, this should be changed in a production cluster, and you're having
> an example of why.
>
> Now when i altered the system_traces keyspace startegy to
> NetworkTopologyStrategy and RF=2
> but then running nodetool repair failed: Endpoint not alive /IP of dead
> node that i'm trying to replace.
>
>
> Changing the replication strategy you made the dead

Re: stuck with num_tokens 256

2018-09-22 Thread kurt greaves

No, that's not true.

On Sat., 22 Sep. 2018, 21:58 onmstester onmstester, 
wrote:

>
> If you have problems with balance you can add new nodes using the
> algorithm and it'll balance out the cluster. You probably want to stick to
> 256 tokens though.
>
>
> I read somewhere (don't remember the ref) that all nodes of the cluster
> should use the same algorithm, so if my cluster suffer from imbalanced
> nodes using random algorithm i can not add new nodes that are using
> Allocation algorithm. isn't that correct?
>
>
>

Re: stuck with num_tokens 256

2018-09-22 Thread kurt greaves

>
> But one more question, should i use num_tokens : 8 (i would follow
> datastax recommendation) and allocate_tokens_for_local_replication_factor=3
> (which is max RF among my keyspaces) for new clusters which i'm going to
> setup?

16 is probably where it's at. Test beforehand though.

> Is the Allocation algorithm, now recommended algorithm and mature enough
> to replace the Random algorithm? if its so, it should be the default one at
> 4.0?

Let's leave that discussion to the other thread on the dev list.

On Sat, 22 Sep 2018 at 20:35, onmstester onmstester 
wrote:

> Thanks,
> Because all my clusters are already balanced, i won't change their config
> But one more question, should i use num_tokens : 8 (i would follow
> datastax recommendation) and allocate_tokens_for_local_replication_factor=3
> (which is max RF among my keyspaces) for new clusters which i'm going to
> setup?
> Is the Allocation algorithm, now recommended algorithm and mature enough
> to replace the Random algorithm? if its so, it should be the default one at
> 4.0?
>
>
>  On Sat, 22 Sep 2018 13:41:47 +0330 *kurt greaves
> >* wrote 
>
> If you have problems with balance you can add new nodes using the
> algorithm and it'll balance out the cluster. You probably want to stick to
> 256 tokens though.
> To reduce your # tokens you'll have to do a DC migration (best way). Spin
> up a new DC using the algorithm on the nodes and set a lower number of
> tokens. You'll want to test first but if you create a new keyspace for the
> new DC prior to creation of the new nodes with the desired RF (ie. a
> keyspace just in the "new" DC with your RF) then add your nodes using that
> keyspace for allocation tokens *should* be distributed evenly amongst
> that DC, and when migrate you can decommission the old DC and hopefully end
> up with a balanced cluster.
> Definitely test beforehand though because that was just me theorising...
>
> I'll note though that if your existing clusters don't have any major
> issues it's probably not worth the migration at this point.
>
> On Sat, 22 Sep 2018 at 17:40, onmstester onmstester 
> wrote:
>
>
> I noticed that currently there is a discussion in ML with
> subject: changing default token behavior for 4.0.
> Any recommendation to guys like me who already have multiple clusters ( >
> 30 nodes in each cluster) with random partitioner and num_tokens = 256?
> I should also add some nodes to existing clusters, is it possible
> with num_tokens = 256?
> How could we fix this bug (reduce num_tokens in existent clusters)?
> Cassandra version: 3.11.2
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>

Re: stuck with num_tokens 256

2018-09-22 Thread kurt greaves

If you have problems with balance you can add new nodes using the algorithm
and it'll balance out the cluster. You probably want to stick to 256 tokens
though.
To reduce your # tokens you'll have to do a DC migration (best way). Spin
up a new DC using the algorithm on the nodes and set a lower number of
tokens. You'll want to test first but if you create a new keyspace for the
new DC prior to creation of the new nodes with the desired RF (ie. a
keyspace just in the "new" DC with your RF) then add your nodes using that
keyspace for allocation tokens *should* be distributed evenly amongst that
DC, and when migrate you can decommission the old DC and hopefully end up
with a balanced cluster.
Definitely test beforehand though because that was just me theorising...

I'll note though that if your existing clusters don't have any major issues
it's probably not worth the migration at this point.

On Sat, 22 Sep 2018 at 17:40, onmstester onmstester 
wrote:

> I noticed that currently there is a discussion in ML with
> subject: changing default token behavior for 4.0.
> Any recommendation to guys like me who already have multiple clusters ( >
> 30 nodes in each cluster) with random partitioner and num_tokens = 256?
> I should also add some nodes to existing clusters, is it possible
> with num_tokens = 256?
> How could we fix this bug (reduce num_tokens in existent clusters)?
> Cassandra version: 3.11.2
>
> Sent using Zoho Mail 
>
>
>

Re: Recommended num_tokens setting for small cluster

2018-08-29 Thread kurt greaves

For 10 nodes you probably want to use between 32 and 64. Make sure you use
the token allocation algorithm by specifying allocate_tokens_for_keyspace

On Thu., 30 Aug. 2018, 04:40 Jeff Jirsa,  wrote:

> 3.0 has a (optional?) feature to guarantee better distribution, and the
> blog focuses on 2.2.
>
> Using fewer will minimize your risk of unavailability if any two hosts
> fail.
>
> --
> Jeff Jirsa
>
>
> On Aug 29, 2018, at 11:18 AM, Max C.  wrote:
>
> Hello Everyone,
>
> Datastax recommends num_tokens = 8 as a sensible default, rather than
> num_tokens = 256:
>
>
> https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configVnodes.html
>
> … but then I see stories like this (unbalanced cluster when using
> num_tokens=12), which are very concerning:
>
>
> https://danielparker.me/cassandra/vnodes/tokens/increasing-vnodes-cassandra/
>
> We’re currently running 3.0.x, 3 nodes, RF=3, num_tokens=256, spinning
> disks, soon to be 2 DCs.   My guess is that our cluster will probably not
> grow beyond 10 nodes (10 TB?)
>
> I’d like to minimize the chance of hitting a roadblock down the road due
> to having num_tokens set inappropriately.   We can change this right now
> pretty easily (our dataset is small but growing).  Should we switch from
> 256 to 8?  32?
>
> Has anyone had num_tokens = 8 (or similarly small number) and experienced
> growing pains?  What do you think the recommended setting should be?
>
> Thanks for the advice.  :-)
>
> - Max
>
>

Re: URGENT: disable reads from node

2018-08-29 Thread kurt greaves

Note that you'll miss incoming writes if you do that, so you'll be
inconsistent even after the repair. I'd say best to just query at QUORUM
until you can finish repairs.

On 29 August 2018 at 21:22, Alexander Dejanovski 
wrote:

> Hi Vlad, you must restart the node but first disable joining the cluster,
> as described in the second part of this blog post :
> http://thelastpickle.com/blog/2018/08/02/Re-Bootstrapping-
> Without-Bootstrapping.html
>
> Once repaired, you'll have to run "nodetool join" to start serving reads.
>
>
> Le mer. 29 août 2018 à 12:40, Vlad  a écrit :
>
>> Will it help to set read_repair_chance to 1 (compaction is
>> SizeTieredCompactionStrategy)?
>>
>>
>> On Wednesday, August 29, 2018 1:34 PM, Vlad 
>> wrote:
>>
>>
>> Hi,
>>
>> quite urgent questions:
>> due to disk and C* start problem we were forced to delete commit logs
>> from one of nodes.
>>
>> Now repair is running, but meanwhile some reads bring no data (RF=2)
>>
>> Can this node be excluded from reads queries? And that  all reads will be
>> redirected to other node in the ring?
>>
>>
>> Thanks to All for help.
>>
>>
>> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: Nodetool refresh v/s sstableloader

2018-08-29 Thread kurt greaves

Removing dev...
Nodetool refresh only picks up new SSTables that have been placed in the
tables directory. It doesn't account for actual ownership of the data like
SSTableloader does. Refresh will only work properly if the SSTables you are
copying in are completely covered by that nodes tokens. It doesn't work if
there's a change in topology, replication and token ownership will have to
be more or less the same.

SSTableloader will break up the SSTables and send the relevant bits to
whichever node needs it, so no need for you to worry about tokens and
copying data to the right places, it will do that for you.

On 28 August 2018 at 11:27, Rajath Subramanyam  wrote:

> Hi Cassandra users, Cassandra dev,
>
> When recovering using SSTables from a snapshot, I want to know what are
> the key differences between using:
> 1. Nodetool refresh and,
> 2. SSTableloader
>
> Does nodetool refresh have restrictions that need to be met?
> Does nodetool refresh work even if there is a change in the topology
> between the source cluster and the destination cluster? Does it work if the
> token ranges don't match between the source cluster and the destination
> cluster? Does it work when an old SSTable in the snapshot has a dropped
> column that is not part of the current schema?
>
> I appreciate any help in advance.
>
> Thanks,
> Rajath
> 
> Rajath Subramanyam
>
>

Re: Re: bigger data density with Cassandra 4.0?

2018-08-29 Thread kurt greaves

Most of the issues around big nodes is related to streaming, which is
currently quite slow (should be a bit better in 4.0). HBase is built on top
of hadoop, which is much better at large files/very dense nodes, and tends
to be quite average for transactional data. ScyllaDB IDK, I'd assume they
just sorted out streaming by learning from C*'s mistakes.

On 29 August 2018 at 19:43, onmstester onmstester 
wrote:

> Thanks Kurt,
> Actually my cluster has > 10 nodes, so there is a tiny chance to stream a
> complete SSTable.
> While logically any Columnar noSql db like Cassandra, needs always to
> re-sort grouped data for later-fast-reads and having nodes with big amount
> of data (> 2 TB) would be annoying for this background process, How is it
> possible that some of these databases like HBase and Scylla db does not
> emphasis on small nodes (like Cassandra do)?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
> ==== Forwarded message 
> From : kurt greaves 
> To : "User"
> Date : Wed, 29 Aug 2018 12:03:47 +0430
> Subject : Re: bigger data density with Cassandra 4.0?
>  Forwarded message 
>
> My reasoning was if you have a small cluster with vnodes you're more
> likely to have enough overlap between nodes that whole SSTables will be
> streamed on major ops. As  N gets >RF you'll have less common ranges and
> thus less likely to be streaming complete SSTables. Correct me if I've
> misunderstood.
>
>
>
>

Re: bigger data density with Cassandra 4.0?

2018-08-29 Thread kurt greaves

My reasoning was if you have a small cluster with vnodes you're more likely
to have enough overlap between nodes that whole SSTables will be streamed
on major ops. As  N gets >RF you'll have less common ranges and thus less
likely to be streaming complete SSTables. Correct me if I've misunderstood.

On 28 August 2018 at 01:37, Dinesh Joshi 
wrote:

> Although the extent of benefits depend on the specific use case, the
> cluster size is definitely not a limiting factor.
>
> Dinesh
>
> On Aug 27, 2018, at 5:05 AM, kurt greaves  wrote:
>
> I believe there are caveats that it will only really help if you're not
> using vnodes, or you have a very small cluster, and also internode
> encryption is not enabled. Alternatively if you're using JBOD vnodes will
> be marginally better, but JBOD is not a great idea (and doesn't guarantee a
> massive improvement).
>
> On 27 August 2018 at 15:46, dinesh.jo...@yahoo.com.INVALID <
> dinesh.jo...@yahoo.com.invalid> wrote:
>
>> Yes, this feature will help with operating nodes with higher data density.
>>
>> Dinesh
>>
>>
>> On Saturday, August 25, 2018, 9:01:27 PM PDT, onmstester onmstester <
>> onmstes...@zoho.com> wrote:
>>
>>
>> I've noticed this new feature of 4.0:
>> Streaming optimizations (https://cassandra.apache.org/
>> blog/2018/08/07/faster_streaming_in_cassandra.html)
>> Is this mean that we could have much more data density with Cassandra 4.0
>> (less problems than 3.X)? I mean > 10 TB of data on each node without
>> worrying about node join/remove?
>> This is something needed for Write-Heavy applications that do not read a
>> lot. When you have like 2 TB of data per day and need to keep it for 6
>> month, it would be waste of money to purchase 180 servers (even Commodity
>> or Cloud).
>> IMHO, even if 4.0 fix problem with streaming/joining a new node, still
>> Compaction is another evil for a big node, but we could tolerate that
>> somehow
>>
>> Sent using Zoho Mail <https://www.zoho.com/mail/>
>>
>>
>>
>

Re: 2.2 eats memory

2018-08-27 Thread kurt greaves

I'm thinking it's unlikely that top is lying to you. Are you sure that
you're measuring free memory versus available memory? Cassandra will
utilise the OS page cache heavily, which will cache files in memory but
leave the memory able to be reclaimed if needed. Have you checked the
output of free? If available memory is still high you're perfectly fine and
everything is working as expected.

On 27 August 2018 at 21:32, Matthias Pfau  wrote:

> Hi there,
> after upgrading from 2.1 to 2.2.13, Cassandra eats up all available memory
> within one week. The following is a diagram of the left available RAM of a
> single node over the course of a week:
> https://imgur.com/a/H9BDBxC 
>
> Nodes are bare metal, 12 cores with 64GB RAM each, swapping disabled and
> configured with row cache disabled and auto sized key cache. Jemalloc is
> not configured.
>
> This is the same for all nodes. There is no noticeable cpu load on these
> nodes, disk io is also fine, just as tpstats. Heap dumps look fine and if
> you check memory usage via JMX everything seems to be in order also (max
> heap 8GB, heap memory usage goes down to 1GB after GC)
>
> The memory usage reported by top (res) is always around 20GB for the
> cassandra process which also seems to high for us (heap + off heap) as the
> offheap memory allocated by DirectByteBuffers is only around 400MB
> (reported by jxray).
>
> The available system memory slowly degrades all the time without any other
> memory intensive processes running. We currently drain and restart our
> nodes, if the available memory sinks below 5GB. That means that after a
> week, there is an offset of around 40GB between the reported memory usage
> of cassandra and the real memory usage.
>
> If we stop cassandra, nearly 100% of the system memory is available again.
>
> Has anybody observed similar memory usage with 2.2?
>
> If so, how to measure cassandras real memory usage if top and other system
> tools are way off?
>
> Do you know if this problem has been solved with 3.11?
>
> Cheers,
> Matthias
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: bigger data density with Cassandra 4.0?

2018-08-27 Thread kurt greaves

I believe there are caveats that it will only really help if you're not
using vnodes, or you have a very small cluster, and also internode
encryption is not enabled. Alternatively if you're using JBOD vnodes will
be marginally better, but JBOD is not a great idea (and doesn't guarantee a
massive improvement).

On 27 August 2018 at 15:46, dinesh.jo...@yahoo.com.INVALID <
dinesh.jo...@yahoo.com.invalid> wrote:

> Yes, this feature will help with operating nodes with higher data density.
>
> Dinesh
>
>
> On Saturday, August 25, 2018, 9:01:27 PM PDT, onmstester onmstester <
> onmstes...@zoho.com> wrote:
>
>
> I've noticed this new feature of 4.0:
> Streaming optimizations (https://cassandra.apache.org/
> blog/2018/08/07/faster_streaming_in_cassandra.html)
> Is this mean that we could have much more data density with Cassandra 4.0
> (less problems than 3.X)? I mean > 10 TB of data on each node without
> worrying about node join/remove?
> This is something needed for Write-Heavy applications that do not read a
> lot. When you have like 2 TB of data per day and need to keep it for 6
> month, it would be waste of money to purchase 180 servers (even Commodity
> or Cloud).
> IMHO, even if 4.0 fix problem with streaming/joining a new node, still
> Compaction is another evil for a big node, but we could tolerate that
> somehow
>
> Sent using Zoho Mail 
>
>
>

Re: Configuration parameter to reject incremental repair?

2018-08-20 Thread kurt greaves

Yeah I meant 2.2. Keep telling myself it was 3.0 for some reason.

On 20 August 2018 at 19:29, Oleksandr Shulgin 
wrote:

> On Mon, Aug 13, 2018 at 1:31 PM kurt greaves  wrote:
>
>> No flag currently exists. Probably a good idea considering the serious
>> issues with incremental repairs since forever, and the change of defaults
>> since 3.0.
>>
>
> Hi Kurt,
>
> Did you mean since 2.2 (when incremental became the default one)?  Or was
> there more to it that I'm not aware of?
>
> Thanks,
> --
> Alex
>
>

Re: JBOD disk failure

2018-08-17 Thread kurt greaves

As far as I'm aware, yes. I recall hearing someone mention tying system
tables to a particular disk but at the moment that doesn't exist.

On Fri., 17 Aug. 2018, 01:04 Eric Evans,  wrote:

> On Wed, Aug 15, 2018 at 3:23 AM kurt greaves  wrote:
> > Yep. It might require a full node replace depending on what data is lost
> from the system tables. In some cases you might be able to recover from
> partially lost system info, but it's not a sure thing.
>
> Ugh, does it really just boil down to what part of `system` happens to
> be on the disk in question?  In my mind, that makes the only sane
> operational procedure for a failed disk to be: "replace the entire
> node".  IOW, I don't think we can realistically claim you can survive
> a failed a JBOD device if it relies on happenstance.
>
> > On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, <
> christian.lor...@webtrekk.com> wrote:
> >>
> >> Thank you for the answers. We are using the current version 3.11.3 So
> this one includes CASSANDRA-6696.
> >>
> >> So if I get this right, losing system tables will need a full node
> rebuild. Otherwise repair will get the node consistent again.
> >
> > [ ... ]
>
> --
> Eric Evans
> john.eric.ev...@gmail.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: JBOD disk failure

2018-08-15 Thread kurt greaves

Yep. It might require a full node replace depending on what data is lost
from the system tables. In some cases you might be able to recover from
partially lost system info, but it's not a sure thing.

On Wed., 15 Aug. 2018, 17:55 Christian Lorenz, <
christian.lor...@webtrekk.com> wrote:

> Thank you for the answers. We are using the current version 3.11.3 So this
> one includes CASSANDRA-6696.
>
> So if I get this right, losing system tables will need a full node
> rebuild. Otherwise repair will get the node consistent again.
>
>
>
> Regards,
>
> Christian
>
>
>
>
>
> *Von: *kurt greaves 
> *Antworten an: *"user@cassandra.apache.org" 
> *Datum: *Mittwoch, 15. August 2018 um 04:53
> *An: *User 
> *Betreff: *Re: JBOD disk failure
>
>
>
> If that disk had important data in the system tables however you might
> have some trouble and need to replace the entire instance anyway.
>
>
>
> On 15 August 2018 at 12:20, Jeff Jirsa  wrote:
>
> Depends on version
>
>
>
> For versions without the fix from Cassandra-6696, the only safe option on
> single disk failure is to stop and replace the whole instance - this is
> important because in older versions of Cassandra, you could have data in
> one sstable, a tombstone shadowing it in another disk, and it could be very
> far behind gc_grace_seconds. On disk failure in this scenario, if the disk
> holding the tombstone is lost, repair will propagate the
> (deleted/resurrected) data to the other replicas, which probably isn’t what
> you want to happen.
>
>
>
> With 6696, you should be safe to replace the disk and run repair - 6696
> will keep data for a given token range all on the same disks, so the
> resurrection problem is solved.
>
>
>
>
>
> --
>
> Jeff Jirsa
>
>
>
>
> On Aug 14, 2018, at 6:10 AM, Christian Lorenz <
> christian.lor...@webtrekk.com> wrote:
>
> Hi,
>
>
>
> given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting
> data, what happens if the nodes are setup with JBOD and one disk fails? Do
> I get consistent results while the broken drive is replaced and a nodetool
> repair is running on the node with the replaced drive?
>
>
>
> Kind regards,
>
> Christian
>
>
>

Re: JBOD disk failure

2018-08-14 Thread kurt greaves

If that disk had important data in the system tables however you might have
some trouble and need to replace the entire instance anyway.

On 15 August 2018 at 12:20, Jeff Jirsa  wrote:

> Depends on version
>
> For versions without the fix from Cassandra-6696, the only safe option on
> single disk failure is to stop and replace the whole instance - this is
> important because in older versions of Cassandra, you could have data in
> one sstable, a tombstone shadowing it in another disk, and it could be very
> far behind gc_grace_seconds. On disk failure in this scenario, if the disk
> holding the tombstone is lost, repair will propagate the
> (deleted/resurrected) data to the other replicas, which probably isn’t what
> you want to happen.
>
> With 6696, you should be safe to replace the disk and run repair - 6696
> will keep data for a given token range all on the same disks, so the
> resurrection problem is solved.
>
>
> --
> Jeff Jirsa
>
>
> On Aug 14, 2018, at 6:10 AM, Christian Lorenz <
> christian.lor...@webtrekk.com> wrote:
>
> Hi,
>
>
>
> given a cluster with RF=3 and CL=LOCAL_ONE and application is deleting
> data, what happens if the nodes are setup with JBOD and one disk fails? Do
> I get consistent results while the broken drive is replaced and a nodetool
> repair is running on the node with the replaced drive?
>
>
>
> Kind regards,
>
> Christian
>
>

Re: 90million reads

2018-08-14 Thread kurt greaves

Not a great idea to make config changes without testing. For a lot of
changes you can make the change on one node and measure of three is an
improvement however.

You'd probably be best to add nodes (double should be sufficient), do
tuning and testing afterwards, and then decommission a few nodes if you can.

On Wed., 15 Aug. 2018, 05:00 Abdul Patel,  wrote:

> Currently our cassandra prod is 18 node 3 dc cluster and application does
> 55 million reads per day and want to add load and make it 90 millon reads
> per day.they need a guestimate of resources which we need to bump without
> testing ..on top of my head we can increase heap and  native trasport value
> ..any other paramters i should be concern?

Re: Data Corruption due to multiple Cassandra 2.1 processes?

2018-08-13 Thread kurt greaves

New ticket for backporting, referencing the existing.

On Mon., 13 Aug. 2018, 22:50 Steinmaurer, Thomas, <
thomas.steinmau...@dynatrace.com> wrote:

> Thanks Kurt.
>
>
>
> What is the proper workflow here to get this accepted? Create a new ticket
> dedicated for the backport referencing 11540 or re-open 11540?
>
>
>
> Thanks for your help.
>
>
>
> Thomas
>
>
>
> *From:* kurt greaves 
> *Sent:* Montag, 13. August 2018 13:24
> *To:* User 
> *Subject:* Re: Data Corruption due to multiple Cassandra 2.1 processes?
>
>
>
> Yeah that's not ideal and could lead to problems. I think corruption is
> only likely if compactions occur, but seems like data loss is a potential
> not to mention all sorts of other possible nasties that could occur running
> two C*'s at once. Seems to me that 11540 should have gone to 2.1 in the
> first place, but it just got missed. Very simple patch so I think a
> backport should be accepted.
>
>
>
> On 7 August 2018 at 15:57, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hello,
>
>
>
> with 2.1, in case a second Cassandra process/instance is started on a host
> (by accident), may this result in some sort of corruption, although
> Cassandra will exit at some point in time due to not being able to bind TCP
> ports already in use?
>
>
>
> What we have seen in this scenario is something like that:
>
>
>
> ERROR [main] 2018-08-05 21:10:24,046 CassandraDaemon.java:120 - Error
> starting local jmx server:
>
> java.rmi.server.ExportException: Port already in use: 7199; nested
> exception is:
>
> java.net.BindException: Address already in use (Bind
> failed)
>
> …
>
>
>
> But then continuing with stuff like opening system and even user tables:
>
>
>
> INFO  [main] 2018-08-05 21:10:24,060 CacheService.java:110 - Initializing
> key cache with capacity of 100 MBs.
>
> INFO  [main] 2018-08-05 21:10:24,067 CacheService.java:132 - Initializing
> row cache with capacity of 0 MBs
>
> INFO  [main] 2018-08-05 21:10:24,073 CacheService.java:149 - Initializing
> counter cache with capacity of 50 MBs
>
> INFO  [main] 2018-08-05 21:10:24,074 CacheService.java:160 - Scheduling
> counter cache save to every 7200 seconds (going to save all keys).
>
> INFO  [main] 2018-08-05 21:10:24,161 ColumnFamilyStore.java:365 -
> Initializing system.sstable_activity
>
> INFO  [SSTableBatchOpen:2] 2018-08-05 21:10:24,692 SSTableReader.java:475
> - Opening
> /var/opt/xxx-managed/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-165
> (2023 bytes)
>
> INFO  [SSTableBatchOpen:3] 2018-08-05 21:10:24,692 SSTableReader.java:475
> - Opening
> /var/opt/xxx-managed/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-167
> (2336 bytes)
>
> INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,692 SSTableReader.java:475
> - Opening
> /var/opt/xxx-managed/cassandra/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-166
> (2686 bytes)
>
> INFO  [main] 2018-08-05 21:10:24,755 ColumnFamilyStore.java:365 -
> Initializing system.hints
>
> INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,758 SSTableReader.java:475
> - Opening
> /var/opt/xxx-managed/cassandra/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-377
> (46210621 bytes)
>
> INFO  [main] 2018-08-05 21:10:24,766 ColumnFamilyStore.java:365 -
> Initializing system.compaction_history
>
> INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,768 SSTableReader.java:475
> - Opening
> /var/opt/xxx-managed/cassandra/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/system-compaction_history-ka-129
> (91269 bytes)
>
> …
>
>
>
> Replaying commit logs:
>
>
>
> …
>
> INFO  [main] 2018-08-05 21:10:25,896 CommitLogReplayer.java:267 -
> Replaying
> /var/opt/dynatrace-managed/cassandra/commitlog/CommitLog-4-1533133668366.log
>
> INFO  [main] 2018-08-05 21:10:25,896 CommitLogReplayer.java:270 -
> Replaying
> /var/opt/dynatrace-managed/cassandra/commitlog/CommitLog-4-1533133668366.log
> (CL version 4, messaging version 8)
>
> …
>
>
>
> Even writing memtables already (below just pasted system tables, but also
> user tables):
>
>
>
> …
>
> INFO  [MemtableFlushWriter:4] 2018-08-05 21:11:52,524 Memtable.java:347 -
> Writing Memtable-size_estimates@1941663179(2.655MiB serialized bytes,
> 325710 ops, 2%/0% of on/off-heap limit)
>
> INFO  [MemtableFlushWriter:3] 2018-08-05 21:11:52,552 Memtable.java:347 -
> Writing Memtable-peer_events@1474667699(0.199K

Re: Configuration parameter to reject incremental repair?

2018-08-13 Thread kurt greaves

No flag currently exists. Probably a good idea considering the serious
issues with incremental repairs since forever, and the change of defaults
since 3.0.

On 7 August 2018 at 16:44, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hello,
>
>
>
> we are running Cassandra in AWS and On-Premise at customer sites,
> currently 2.1 in production with 3.11 in loadtest.
>
>
>
> In a migration path from 2.1 to 3.11.x, I’m afraid that at some point in
> time we end up in incremental repairs being enabled / ran a first time
> unintentionally, cause:
>
> a) A lot of online resources / examples do not use the -full command-line
> option
>
> b) Our internal (support) tickets of course also state nodetool repair
> command without the -full option, as these are for 2.1
>
>
>
> Especially for On-Premise customers (with less control than with our AWS
> deployments), this asks a bit for getting out-of-control once we have 3.11
> out and nodetool repair being run without the -full command-line option.
>
>
>
> So, what do you think about a JVM system property, cassandra.yaml … to
> basically let the operator chose if incremental repairs are allowed or not?
> I know, such a flag still can be flipped then (by the customer), but as a
> first safety stage possibly sufficient enough.
>
>
>
> Or perhaps something like that is already available (vaguely remember
> something like that for MV).
>
>
>
> Thanks a lot,
>
> Thomas
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
>

Re: Data Corruption due to multiple Cassandra 2.1 processes?

2018-08-13 Thread kurt greaves

Yeah that's not ideal and could lead to problems. I think corruption is
only likely if compactions occur, but seems like data loss is a potential
not to mention all sorts of other possible nasties that could occur running
two C*'s at once. Seems to me that 11540 should have gone to 2.1 in the
first place, but it just got missed. Very simple patch so I think a
backport should be accepted.

On 7 August 2018 at 15:57, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hello,
>
>
>
> with 2.1, in case a second Cassandra process/instance is started on a host
> (by accident), may this result in some sort of corruption, although
> Cassandra will exit at some point in time due to not being able to bind TCP
> ports already in use?
>
>
>
> What we have seen in this scenario is something like that:
>
>
>
> ERROR [main] 2018-08-05 21:10:24,046 CassandraDaemon.java:120 - Error
> starting local jmx server:
>
> java.rmi.server.ExportException: Port already in use: 7199; nested
> exception is:
>
> java.net.BindException: Address already in use (Bind
> failed)
>
> …
>
>
>
> But then continuing with stuff like opening system and even user tables:
>
>
>
> INFO  [main] 2018-08-05 21:10:24,060 CacheService.java:110 - Initializing
> key cache with capacity of 100 MBs.
>
> INFO  [main] 2018-08-05 21:10:24,067 CacheService.java:132 - Initializing
> row cache with capacity of 0 MBs
>
> INFO  [main] 2018-08-05 21:10:24,073 CacheService.java:149 - Initializing
> counter cache with capacity of 50 MBs
>
> INFO  [main] 2018-08-05 21:10:24,074 CacheService.java:160 - Scheduling
> counter cache save to every 7200 seconds (going to save all keys).
>
> INFO  [main] 2018-08-05 21:10:24,161 ColumnFamilyStore.java:365 -
> Initializing system.sstable_activity
>
> INFO  [SSTableBatchOpen:2] 2018-08-05 21:10:24,692 SSTableReader.java:475
> - Opening /var/opt/xxx-managed/cassandra/system/sstable_activity-
> 5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-165 (2023
> bytes)
>
> INFO  [SSTableBatchOpen:3] 2018-08-05 21:10:24,692 SSTableReader.java:475
> - Opening /var/opt/xxx-managed/cassandra/system/sstable_activity-
> 5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-167 (2336
> bytes)
>
> INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,692 SSTableReader.java:475
> - Opening /var/opt/xxx-managed/cassandra/system/sstable_activity-
> 5a1ff267ace03f128563cfae6103c65e/system-sstable_activity-ka-166 (2686
> bytes)
>
> INFO  [main] 2018-08-05 21:10:24,755 ColumnFamilyStore.java:365 -
> Initializing system.hints
>
> INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,758 SSTableReader.java:475
> - Opening /var/opt/xxx-managed/cassandra/system/hints-
> 2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-377 (46210621 bytes)
>
> INFO  [main] 2018-08-05 21:10:24,766 ColumnFamilyStore.java:365 -
> Initializing system.compaction_history
>
> INFO  [SSTableBatchOpen:1] 2018-08-05 21:10:24,768 SSTableReader.java:475
> - Opening /var/opt/xxx-managed/cassandra/system/compaction_history-
> b4dbb7b4dc493fb5b3bfce6e434832ca/system-compaction_history-ka-129 (91269
> bytes)
>
> …
>
>
>
> Replaying commit logs:
>
>
>
> …
>
> INFO  [main] 2018-08-05 21:10:25,896 CommitLogReplayer.java:267 -
> Replaying /var/opt/dynatrace-managed/cassandra/commitlog/CommitLog-
> 4-1533133668366.log
>
> INFO  [main] 2018-08-05 21:10:25,896 CommitLogReplayer.java:270 -
> Replaying 
> /var/opt/dynatrace-managed/cassandra/commitlog/CommitLog-4-1533133668366.log
> (CL version 4, messaging version 8)
>
> …
>
>
>
> Even writing memtables already (below just pasted system tables, but also
> user tables):
>
>
>
> …
>
> INFO  [MemtableFlushWriter:4] 2018-08-05 21:11:52,524 Memtable.java:347 -
> Writing Memtable-size_estimates@1941663179(2.655MiB serialized bytes,
> 325710 ops, 2%/0% of on/off-heap limit)
>
> INFO  [MemtableFlushWriter:3] 2018-08-05 21:11:52,552 Memtable.java:347 -
> Writing Memtable-peer_events@1474667699(0.199KiB serialized bytes, 4 ops,
> 0%/0% of on/off-heap limit)
>
> …
>
>
>
> Until it comes to a point where it can’t bind ports like the storage port
> 7000:
>
>
>
> ERROR [main] 2018-08-05 21:11:54,350 CassandraDaemon.java:395 - Fatal
> configuration error
>
> org.apache.cassandra.exceptions.ConfigurationException: /XXX:7000 is in
> use by another process.  Change listen_address:storage_port in
> cassandra.yaml to values that do not conflict with other services
>
> at org.apache.cassandra.net.MessagingService.
> getServerSockets(MessagingService.java:495) ~[apache-cassandra-2.1.18.jar:
> 2.1.18]
>
> …
>
>
>
> Until Cassandra stops:
>
>
>
> …
>
> INFO  [StorageServiceShutdownHook] 2018-08-05 21:11:54,361
> Gossiper.java:1454 - Announcing shutdown
>
> …
>
>
>
>
>
> So, we have around 2 minutes where Cassandra is mangling with existing
> data, although it shouldn’t.
>
>
>
> Sounds like a potential candidate for data corruption, right? E.g. later
> on we then see things like (still while being in progress to shutdown?):

Re: Hinted Handoff

2018-08-06 Thread kurt greaves

>
> Does Cassandra TTL out the hints after max_hint_window_in_ms? From my
> understanding, Cassandra only stops collecting hints after
> max_hint_window_in_ms but can still keep replaying the hints if the node
> comes back again. Is this correct? Is there a way to TTL out hints?


No, but it won't send hints that have passed HH window. Also, this
shouldn't be caused by HH as the hints maintain the original timestamp with
which they were written.

Honestly, this sounds more like a use case for a distributed cache rather
than Cassandra. Keeping data for 30 minutes and then deleting it is going
to be a nightmare to manage in Cassandra.

On 7 August 2018 at 07:20, Agrawal, Pratik 
wrote:

> Does Cassandra TTL out the hints after max_hint_window_in_ms? From my
> understanding, Cassandra only stops collecting hints after
> max_hint_window_in_ms but can still keep replaying the hints if the node
> comes back again. Is this correct? Is there a way to TTL out hints?
>
>
>
> Thanks,
>
> Pratik
>
>
>
> *From: *Kyrylo Lebediev 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, August 6, 2018 at 4:10 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Hinted Handoff
>
>
>
> Small gc_grace_seconds value lowers max allowed node downtime, which is 15
> minutes in your case. After 15 minutes of downtime you'll need to replace
> the node, as you described. This interval looks too short to be able to do
> planned maintenance. So, in case you set larger value for gc_grace_seconds
> (lets say, hours or a day) will you get visible read amplification / waste
> a lot of disk space / issues with compactions?
>
>
>
> Hinted handoff may be the reason in case hinted handoff window is longer
> than gc_grace_seconds. To me it looks like hinted handoff window
> (max_hint_window_in_ms in cassandra.yaml, which defaults to 3h) must always
> be set to a value less than gc_grace_seconds.
>
>
>
> Regards,
>
> Kyrill
> --
>
> *From:* Agrawal, Pratik 
> *Sent:* Monday, August 6, 2018 8:22:27 PM
> *To:* user@cassandra.apache.org
> *Subject:* Hinted Handoff
>
>
>
> Hello all,
>
> We use Cassandra in non-conventional way, where our data is short termed
> (life cycle of about 20-30 minutes) where each record is updated ~5 times
> and then deleted. We have GC grace of 15 minutes.
>
> We are seeing 2 problems
>
> 1.) A certain number of Cassandra nodes goes down and then we remove it
> from the cluster using Cassandra removenode command and replace the dead
> nodes with new nodes. While new nodes are joining in, we see more nodes
> down (which are not actually down) but we see following errors in the log
>
> “Gossip not settled after 321 polls. Gossip Stage
> active/pending/completed: 1/816/0”
>
>
>
> To fix the issue, I restarted the server and the nodes now appear to be up
> and the problem is solved
>
>
>
> Can this problem be related to https://issues.apache.org/
> jira/browse/CASSANDRA-6590 ?
>
>
>
> 2.) Meanwhile, after restarting the nodes mentioned above, we see that
> some old deleted data is resurrected (because of short lifecycle of our
> data). My guess at the moment is that these data is resurrected due to
> hinted handoff. Interesting point to note here is that data keeps
> resurrecting at periodic intervals (like an hour) and then finally stops.
> Could this be caused by hinted handoff? if so is there any setting which we
> can set to specify that “invalidate, hinted handoff data after 5-10
> minutes”.
>
>
>
> Thanks,
> Pratik
>

Re: 3.11.2 memory leak

2018-07-22 Thread kurt greaves

Likely in the next few weeks.

On Mon., 23 Jul. 2018, 01:17 Abdul Patel,  wrote:

> Any idea when 3.11.3 is coming in?
>
> On Tuesday, June 19, 2018, kurt greaves  wrote:
>
>> At this point I'd wait for 3.11.3. If you can't, you can get away with
>> backporting a few repair fixes or just doing sub range repairs on 3.11.2
>>
>> On Wed., 20 Jun. 2018, 01:10 Abdul Patel,  wrote:
>>
>>> Hi All,
>>>
>>> Do we kmow whats the stable version for now if u wish to upgrade ?
>>>
>>> On Tuesday, June 5, 2018, Steinmaurer, Thomas <
>>> thomas.steinmau...@dynatrace.com> wrote:
>>>
>>>> Jeff,
>>>>
>>>>
>>>>
>>>> FWIW, when talking about
>>>> https://issues.apache.org/jira/browse/CASSANDRA-13929, there is a
>>>> patch available since March without getting further attention.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
>>>> *Sent:* Dienstag, 05. Juni 2018 00:51
>>>> *To:* cassandra 
>>>> *Subject:* Re: 3.11.2 memory leak
>>>>
>>>>
>>>>
>>>> There have been a few people who have reported it, but nobody (yet) has
>>>> offered a patch to fix it. It would be good to have a reliable way to
>>>> repro, and/or an analysis of a heap dump demonstrating the problem (what's
>>>> actually retained at the time you're OOM'ing).
>>>>
>>>>
>>>>
>>>> On Mon, Jun 4, 2018 at 6:52 AM, Abdul Patel 
>>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>
>>>>
>>>> I recently upgraded my non prod cluster from 3.10 to 3.11.2.
>>>>
>>>> It was working fine for a 1.5 weeks then suddenly nodetool info startee
>>>> reporting 80% and more memory consumption.
>>>>
>>>> Intially it was 16gb configured, then i bumped to 20gb and rebooted all
>>>> 4 nodes of cluster-single DC.
>>>>
>>>> Now after 8 days i again see 80% + usage and its 16gb and above ..which
>>>> we never saw before .
>>>>
>>>> Seems like memory leak bug?
>>>>
>>>> Does anyone has any idea ? Our 3.11.2 release rollout has been halted
>>>> because of this.
>>>>
>>>> If not 3.11.2 whats the next best stable release we have now?
>>>>
>>>>
>>>> The contents of this e-mail are intended for the named addressee only.
>>>> It contains information that may be confidential. Unless you are the named
>>>> addressee or an authorized designee, you may not copy or use it, or
>>>> disclose it to anyone else. If you received it in error please notify us
>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>>
>>>

Re: Limitations of Hinted Handoff OverloadedException exception

2018-07-16 Thread kurt greaves

The coordinator will refuse to send writes/hints to a node if it has a
large backlog of hints (128 * #cores) already and the destination replica
is one of the nodes with hints destined to it.
It will still send writes to any "healthy" node (a node with no outstanding
hints).

The idea is to not further overload already overloaded nodes. If you see
OverloadedExceptions you'll have to repair after all nodes become stable.

See StorageProxy.java#L1327

 called from StorageProxy.java::sendToHintedEndpoints()

On 13 July 2018 at 05:38, Karthick V  wrote:

> Refs : https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/
> opsRepairNodesHintedHandoff.html
>
> On Thu, Jul 12, 2018 at 7:46 PM Karthick V  wrote:
>
>> Hi everyone,
>>
>>  If several nodes experience brief outages simultaneously, substantial
>>> memory pressure can build up on the coordinator.* The coordinator
>>> tracks how many hints it is currently writing, and if the number increases
>>> too much, the coordinator refuses writes and throws the *
>>> *OverloadedException exception.*
>>
>>
>>  In the above statement, it is been said that after some extent(of
>> hints) the* coordinator *will refuse to writes. can someone explain the
>> depth of this limitations and its dependency if any (like disk size or any)?
>>
>> Regards
>> Karthick V
>>
>>
>>

Re: batchstatement

2018-07-16 Thread kurt greaves

What is the primary key for the user_by_ext table? I'd assume it's ext_id,
which would imply your update doesn't make sense as you can't change the
primary key for a row - which would be the problem you're seeing.

On Sat., 14 Jul. 2018, 06:14 Randy Lynn,  wrote:

> TL/DR:
> - only 1 out of 14 statements in a batch do not mutate the partition..
> - no error is logged in the application layer, or Cassandra system.log or
> Cassandra debug.log
> - using C#, and Datastax latest driver
> - cluster is a 1-node, dev setup.
> - datastax driver configured with LOCAL_QUORUM at the session, and
> statement level.
> - using preparedstatements.. 1,000% sure there's no typo.. (but I've been
> wrong before)
>
>
> I have about 14 statements that get batched up together. They're updating
> at most 2, maybe 3 denormalized tables.. all the same user object, just
> different lookup keys.
>
> To help visualize, the tables look a little like these.. abbreviated..
> User table..
> CREATE TABLE user (u_id uuid, act_id uuid, ext_id text, dt_created
> timeuuid, dt_mod timeuuid, is_group Boolean, first_name text)
>
> Users By Account (or plan)
> CREATE TABLE user_by_act (act_id uuid, u_id uuid, first_name text)
>
> User By external identifier
> CREATE TABLE user_by_ext (ext_id text, u_id uuid, act_id uuid, first_name
> text)
>
> I create a batch that updates all the tables.. various updates are broken
> out into separate statements, so for example, there's a statement that
> updates the external ID in the 'user' table.
>
> UPDATE user_by_ext SET ext_id = :ext_id WHERE u_id = :u_id
>
> This particular batch has 14 statements total, across all 3 tables. They
> are only updating at most 3 partitions.. a single partition may have 4 or
> more statements to update various parts of the partition. e.g. first name
> and last name are a single statement added to the batch.
>
> Here's the problem... of those 14 statements.. across the 3 partitions...
> ONE and ONLY ONE update doesn't work.. Absolutely every other discreet
> update in the whole batch works.
>
> List boundStatements = new
> List();
>
> // *
> // user table
>
> boundStatements.Add(SessionManager.UserInsertStatement.Bind(new { u_id=
> user.UserId, act_id = user.ActId, dt_created = nowId, dt_mod = nowId,
> is_everyone = user.IsEveryone, is_group = user.IsGroup }));
>
> if (!string.IsNullOrWhiteSpace(user.ExtId))
> //
> // this statement gets added to the list.. it is part of the batch
> // but it NEVER updates the actual field in the databse.
> // I have moved it around.. up, down... the only thing that works
> // is if I call execute on the first binding above, and then add the rest
> // of these as a separate batch.
>
> boundStatements.Add(SessionManager.UserUpdateExtIdStatement.Bind(new { u_id
> = user.UserId, ext_id = user.ExtId, dt_mod = nowId }));
> //
>
>
> if (!string.IsNullOrWhiteSpace(user.Email))
>
> boundStatements.Add(SessionManager.UserUpdateEmailStatement.Bind(new { u_id
> = user.UserId, email = user.Email, dt_mod = nowId }));
> BoundStatement userProfile =
> CreateUserProfileBoundStatement(nowId, user);
> if (userProfile != null)
> boundStatements.Add(userProfile);
> // *
> // user_by_act table
> CreateUserAccountInsertBoundStatements(boundStatements, user,
> nowId);
> // *
> // user_by_ext table
> if (!string.IsNullOrWhiteSpace(user.ExtId))
> {
>
> boundStatements.Add(SessionManager.UserExtInsertStatement.Bind(new { ext_id
> = user.ExtId, act_id = user.ActId, dt_created = nowId, dt_mod = nowId,
> is_group = user.IsGroup, u_id = user.UserId }));
> BoundStatement userByExtProfile =
> CreateUserByExtProfileBoundStatement(nowId, user);
> if (userByExtProfile != null)
> boundStatements.Add(userByExtProfile);
> if (!string.IsNullOrWhiteSpace(user.Email))
>
> boundStatements.Add(SessionManager.UserExtUpdateEmailStatement.Bind(new {
> ext_id = user.ExtId, email = user.Email, dt_mod = nowId }));
> }
>
>
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 <+1-859-963-1616> ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
>
>  getavail.com 
>

Re: default_time_to_live vs TTL on insert statement

2018-07-11 Thread kurt greaves

The Datastax documentation is wrong. It won't error, and it shouldn't. If
you want to fix that documentation I suggest contacting Datastax.

On 11 July 2018 at 19:56, Nitan Kainth  wrote:

> Hi DuyHai,
>
> Could you please explain in what case C* will error based on documented
> statement:
>
> You can set a default TTL for an entire table by setting the table's
> default_time_to_live
> 
>  property. If you try to set a TTL for a specific column that is longer
> than the time defined by the table TTL, Cassandra returns an error.
>
>
>
> On Wed, Jul 11, 2018 at 2:34 PM, DuyHai Doan  wrote:
>
>> default_time_to_live
>> 
>>  property applies if you don't specify any TTL on your CQL statement
>>
>> However you can always override the default_time_to_live
>> 
>>  property by specifying a custom value for each CQL statement
>>
>> The behavior is correct, nothing wrong here
>>
>> On Wed, Jul 11, 2018 at 7:31 PM, Nitan Kainth 
>> wrote:
>>
>>> Hi,
>>>
>>> As per document: https://docs.datastax.com/en/cql/3.3/cql/cql_using
>>> /useExpireExample.html
>>>
>>>
>>>-
>>>
>>>You can set a default TTL for an entire table by setting the table's
>>>default_time_to_live
>>>
>>> 
>>> property. If you try to set a TTL for a specific column that is
>>>longer than the time defined by the table TTL, Cassandra returns an 
>>> error.
>>>
>>>
>>> When I tried to test this statement, i found, we can insert data with
>>> TTL greater than default_time_to_live. Is the document needs correction, or
>>> am I mis-understanding it?
>>>
>>> CREATE TABLE test (
>>>
>>> name text PRIMARY KEY,
>>>
>>> description text
>>>
>>> ) WITH bloom_filter_fp_chance = 0.01
>>>
>>> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>
>>> AND comment = ''
>>>
>>> AND compaction = {'class': 'org.apache.cassandra.db.compa
>>> ction.SizeTieredCompactionStrategy', 'max_threshold': '32',
>>> 'min_threshold': '4'}
>>>
>>> AND compression = {'chunk_length_in_kb': '64', 'class': '
>>> org.apache.cassandra.io.compress.LZ4Compressor'}
>>>
>>> AND crc_check_chance = 1.0
>>>
>>> AND dclocal_read_repair_chance = 0.1
>>>
>>> AND default_time_to_live = 240
>>>
>>> AND gc_grace_seconds = 864000
>>>
>>> AND max_index_interval = 2048
>>>
>>> AND memtable_flush_period_in_ms = 0
>>>
>>> AND min_index_interval = 128
>>>
>>> AND read_repair_chance = 0.0
>>>
>>> AND speculative_retry = '99PERCENTILE';
>>>
>>> insert into test (name, description) values ('name5', 'name
>>> description5') using ttl 360;
>>>
>>> select * from test ;
>>>
>>>
>>>  name  | description
>>>
>>> ---+---
>>>
>>>  name5 | name description5
>>>
>>>
>>> SELECT TTL (description) from test;
>>>
>>>
>>>  ttl(description)
>>>
>>> --
>>>
>>>  351
>>>
>>> Can someone please clear this for me?
>>>
>>>
>>>
>>>
>>>
>>>
>>
>

[ANNOUNCE] LDAP Authenticator for Cassandra

2018-07-05 Thread kurt greaves

We've seen a need for an LDAP authentication implementation for Apache
Cassandra so we've gone ahead and created an open source implementation
(ALv2) utilising the pluggable auth support in C*.

Now, I'm positive there are multiple implementations floating around that
haven't been open sourced, and that's understandable given how much of a
nightmare working with LDAP is, so we've come up with an implementation
that will hopefully work for the general case, but should be perfectly
possible to extend, or at least use an example to create your own and maybe
contribute something back ;). It's by no means perfect, but it seems to
work, and we're hoping people with actual LDAP environments can test and
add support/improvements for more weird LDAP based use cases.

You can find the code and setup + configuration instructions on github
, and a blog that goes into
more detail here
.

PS: Don't look too closely at the nasty cache hackery in the 3.11 branch,
I'll fix it in 4.0, I promise. Just be satisfied that it works, I think.

Re: Inconsistent Quorum Read after Quorum Write

2018-07-03 Thread kurt greaves

Shouldn't happen. Any chance you could trace the queries, or have you been
able to reproduce it? Also, what version of Cassandra?

On Wed., 4 Jul. 2018, 06:41 Visa,  wrote:

> Hi all,
>
> We recently experienced an unexpected behavior with C* consistency.
>
> For example, a table t consists of 4 columns - pk , a, b and c. We perform
> Quorum write and then Quorum read (RF=3 / LCS compaction).
>
> The consistency seems to break while repairing is running(repair -pr).
>
> Say, a record already exists in t like
> pk=1, a=1, b=1, c=1
>
> While repair is not running
>
> Quorum Write:
> update t set c = 2 where pk=1
>
> Quorum Read:
> select pk,a,b,c from t where pk=1 limit 1
>
> Returns: (1, 1, 1, 2) as expected.
>
> But if we do it while repair is running,
>
> Quorum Write:
> update t set c=3 where pk=1
>
> Quorum Read, however, returns (1, null, null, 3) w/o values of a and b.
>
> After repair is done, then the same Quorum Read returns the right values
> (1,1,1,3).
>
> It does not happen to every row in t. The impacted rows are like 40 out of
> 300 millions. But still how the consistency gets broken here?
>
> Thanks for your attention!
>
> Li
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: C* in multiple AWS AZ's

2018-06-29 Thread kurt greaves

Yes. You would just end up with a rack named differently to the AZ. This is
not a problem as racks are just logical. I would recommend migrating all
your DCs to GPFS though for consistency.

On Fri., 29 Jun. 2018, 09:04 Randy Lynn,  wrote:

> So we have two data centers already running..
>
> AP-SYDNEY, and US-EAST.. I'm using Ec2Snitch over a site-to-site tunnel..
> I'm wanting to move the current US-EAST from AZ 1a to 1e..
> I know all docs say use ec2multiregion for multi-DC.
>
> I like the GPFS idea. would that work with the multi-DC too?
> What's the downside? status would report rack of 1a, even though in 1e?
>
> Thanks in advance for the help/thoughts!!
>
>
> On Thu, Jun 28, 2018 at 6:20 PM, kurt greaves 
> wrote:
>
>> There is a need for a repair with both DCs as rebuild will not stream all
>> replicas, so unless you can guarantee you were perfectly consistent at time
>> of rebuild you'll want to do a repair after rebuild.
>>
>> On another note you could just replace the nodes but use GPFS instead of
>> EC2 snitch, using the same rack name.
>>
>> On Fri., 29 Jun. 2018, 00:19 Rahul Singh, 
>> wrote:
>>
>>> Parallel load is the best approach and then switch your Data access code
>>> to only access the new hardware. After you verify that there are no local
>>> read / writes on the OLD dc and that the updates are only via Gossip, then
>>> go ahead and change the replication factor on the key space to have zero
>>> replicas in the old DC. Then you can decommissioned.
>>>
>>> This way you are hundred percent sure that you aren’t missing any new
>>> data. No need for a DC to DC repair but a repair is always healthy.
>>>
>>> Rahul
>>> On Jun 28, 2018, 9:15 AM -0500, Randy Lynn , wrote:
>>>
>>> Already running with Ec2.
>>>
>>> My original thought was a new DC parallel to the current, and then
>>> decommission the other DC.
>>>
>>> Also my data load is small right now.. I know small is relative term..
>>> each node is carrying about 6GB..
>>>
>>> So given the data size, would you go with parallel DC or let the new AZ
>>> carry a heavy load until the others are migrated over?
>>> and then I think "repair" to cleanup the replications?
>>>
>>>
>>> On Thu, Jun 28, 2018 at 10:09 AM, Rahul Singh <
>>> rahul.xavier.si...@gmail.com> wrote:
>>>
>>>> You don’t have to use EC2 snitch on AWS but if you have already started
>>>> with it , it may put a node in a different DC.
>>>>
>>>> If your data density won’t be ridiculous You could add 3 to different
>>>> DC/ Region and then sync up. After the new DC is operational you can remove
>>>> one at a time on the old DC and at the same time add to the new one.
>>>>
>>>> Rahul
>>>> On Jun 28, 2018, 9:03 AM -0500, Randy Lynn , wrote:
>>>>
>>>> I have a 6-node cluster I'm migrating to the new i3 types.
>>>> But at the same time I want to migrate to a different AZ.
>>>>
>>>> What happens if I do the "running node replace method" with 1 node at a
>>>> time moving to the new AZ. Meaning, I'll have temporarily;
>>>>
>>>> 5 nodes in AZ 1c
>>>> 1 new node in AZ 1e.
>>>>
>>>> I'll wash-rinse-repeat till all 6 are on the new machine type and in
>>>> the new AZ.
>>>>
>>>> Any thoughts about whether this gets weird with the Ec2Snitch and a RF
>>>> 3?
>>>>
>>>> --
>>>> Randy Lynn
>>>> rl...@getavail.com
>>>>
>>>> office:
>>>> 859.963.1616 <+1-859-963-1616> ext 202
>>>> 163 East Main Street - Lexington, KY 40507 - USA
>>>> <https://maps.google.com/?q=163+East+Main+Street+-+Lexington,+KY+40507+-+USA=gmail=g>
>>>>
>>>> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>>>>
>>>>
>>>
>>>
>>> --
>>> Randy Lynn
>>> rl...@getavail.com
>>>
>>> office:
>>> 859.963.1616 <+1-859-963-1616> ext 202
>>> 163 East Main Street - Lexington, KY 40507 - USA
>>> <https://maps.google.com/?q=163+East+Main+Street+-+Lexington,+KY+40507+-+USA=gmail=g>
>>>
>>> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>>>
>>>
>
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 <+1-859-963-1616> ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
>
> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>

Re: C* in multiple AWS AZ's

2018-06-28 Thread kurt greaves

There is a need for a repair with both DCs as rebuild will not stream all
replicas, so unless you can guarantee you were perfectly consistent at time
of rebuild you'll want to do a repair after rebuild.

On another note you could just replace the nodes but use GPFS instead of
EC2 snitch, using the same rack name.

On Fri., 29 Jun. 2018, 00:19 Rahul Singh, 
wrote:

> Parallel load is the best approach and then switch your Data access code
> to only access the new hardware. After you verify that there are no local
> read / writes on the OLD dc and that the updates are only via Gossip, then
> go ahead and change the replication factor on the key space to have zero
> replicas in the old DC. Then you can decommissioned.
>
> This way you are hundred percent sure that you aren’t missing any new
> data. No need for a DC to DC repair but a repair is always healthy.
>
> Rahul
> On Jun 28, 2018, 9:15 AM -0500, Randy Lynn , wrote:
>
> Already running with Ec2.
>
> My original thought was a new DC parallel to the current, and then
> decommission the other DC.
>
> Also my data load is small right now.. I know small is relative term..
> each node is carrying about 6GB..
>
> So given the data size, would you go with parallel DC or let the new AZ
> carry a heavy load until the others are migrated over?
> and then I think "repair" to cleanup the replications?
>
>
> On Thu, Jun 28, 2018 at 10:09 AM, Rahul Singh <
> rahul.xavier.si...@gmail.com> wrote:
>
>> You don’t have to use EC2 snitch on AWS but if you have already started
>> with it , it may put a node in a different DC.
>>
>> If your data density won’t be ridiculous You could add 3 to different DC/
>> Region and then sync up. After the new DC is operational you can remove one
>> at a time on the old DC and at the same time add to the new one.
>>
>> Rahul
>> On Jun 28, 2018, 9:03 AM -0500, Randy Lynn , wrote:
>>
>> I have a 6-node cluster I'm migrating to the new i3 types.
>> But at the same time I want to migrate to a different AZ.
>>
>> What happens if I do the "running node replace method" with 1 node at a
>> time moving to the new AZ. Meaning, I'll have temporarily;
>>
>> 5 nodes in AZ 1c
>> 1 new node in AZ 1e.
>>
>> I'll wash-rinse-repeat till all 6 are on the new machine type and in the
>> new AZ.
>>
>> Any thoughts about whether this gets weird with the Ec2Snitch and a RF 3?
>>
>> --
>> Randy Lynn
>> rl...@getavail.com
>>
>> office:
>> 859.963.1616 <+1-859-963-1616> ext 202
>> 163 East Main Street - Lexington, KY 40507 - USA
>> 
>>
>>  getavail.com 
>>
>>
>
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 <+1-859-963-1616> ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
>
>  getavail.com 
>
>

Re: Re: Re: stream failed when bootstrap

2018-06-28 Thread kurt greaves

Yeah, but you only really need to drain, restart Cassandra one by one. Not
that the others will hurt, but they aren't strictly necessary.

On 28 June 2018 at 05:38, dayu  wrote:

> Hi kurt, a rolling restart means run disablebinary, disablethrift, 
> disablegossip, drain,
> stop cassandra and start cassandra command one by one, right?
> Only one node is executed at a time
>
> Dayu
>
>
>
> At 2018-06-28 11:37:43, "kurt greaves"  wrote:
>
> Best off trying a rolling restart.
>
> On 28 June 2018 at 03:18, dayu  wrote:
>
>> the output of nodetool describecluster
>> Cluster Information:
>> Name: online-xxx
>> Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
>> Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
>> Schema versions:
>> c3f00d61-1ad7-3702-8703-af2a29e401c1: [10.136.71.43]
>>
>> 0568e8c1-48ba-3fb0-bb3c-462438978d7b: [10.136.71.33, ]
>>
>> after I run nodetool resetlocalschema, a error log outcome
>>
>> ERROR [InternalResponseStage:209417] 2018-06-28 11:14:12,904
>> MigrationTask.java:96 - Configuration
>> exception merging remote schema
>> org.apache.cassandra.exceptions.ConfigurationException: Column family ID
>> mismatch (found 5552bba0-2
>> dc6-11e8-9b5c-254242d97235; expected 53f6d520-2dc6-11e8-948d-ab7caa
>> 3c8c36)
>> at 
>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:790)
>> ~[apac
>> he-cassandra-3.0.10.jar:3.0.10]
>> at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:750)
>> ~[apache-cassandra-3.0
>> .10.jar:3.0.10]
>> at org.apache.cassandra.config.Schema.updateTable(Schema.java:661)
>> ~[apache-cassandra-3.0.1
>> 0.jar:3.0.10]
>> at 
>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1348)
>> ~[ap
>> ache-cassandra-3.0.10.jar:3.0.10]
>>
>>
>>
>>
>>
>> At 2018-06-28 10:01:52, "Jeff Jirsa"  wrote:
>>
>> You can sometimes bounce your way through it (or use nodetool
>> resetlocalschema if it’s a single node that’s wrong), but there are some
>> edge cases from which it’s very hard to recover
>>
>> What’s the output of nodetool describecluster?
>>
>> If you select from the schema tables, do you see that CFID on any real
>> tables?
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Jun 27, 2018, at 7:58 PM, dayu  wrote:
>>
>> That sound reasonable, I have seen schema mismatch error before.
>> So any advise to deal with schema mismatches?
>>
>> Dayu
>>
>> At 2018-06-28 09:50:37, "Jeff Jirsa"  wrote:
>> >That log message says you did:
>> >
>> > CF 53f6d520-2dc6-11e8-948d-ab7caa3c8c36 was dropped during streaming
>> >
>> >If you’re absolutely sure you didn’t, you should look for schema mismatches 
>> >in your cluster
>> >
>> >
>> >--
>> >Jeff Jirsa
>> >
>> >
>> >> On Jun 27, 2018, at 7:49 PM, dayu  wrote:
>> >>
>> >> CF 53f6d520-2dc6-11e8-948d-ab7caa3c8c36 was dropped during streaming
>> >
>> >-
>> >To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> >For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>

Re: Re: stream failed when bootstrap

2018-06-27 Thread kurt greaves

Best off trying a rolling restart.

On 28 June 2018 at 03:18, dayu  wrote:

> the output of nodetool describecluster
> Cluster Information:
> Name: online-xxx
> Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
> Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
> Schema versions:
> c3f00d61-1ad7-3702-8703-af2a29e401c1: [10.136.71.43]
>
> 0568e8c1-48ba-3fb0-bb3c-462438978d7b: [10.136.71.33, ]
>
> after I run nodetool resetlocalschema, a error log outcome
>
> ERROR [InternalResponseStage:209417] 2018-06-28 11:14:12,904
> MigrationTask.java:96 - Configuration
> exception merging remote schema
> org.apache.cassandra.exceptions.ConfigurationException: Column family ID
> mismatch (found 5552bba0-2
> dc6-11e8-9b5c-254242d97235; expected 53f6d520-2dc6-11e8-948d-ab7caa3c8c36)
> at 
> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:790)
> ~[apac
> he-cassandra-3.0.10.jar:3.0.10]
> at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:750)
> ~[apache-cassandra-3.0
> .10.jar:3.0.10]
> at org.apache.cassandra.config.Schema.updateTable(Schema.java:661)
> ~[apache-cassandra-3.0.1
> 0.jar:3.0.10]
> at 
> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1348)
> ~[ap
> ache-cassandra-3.0.10.jar:3.0.10]
>
>
>
>
>
> At 2018-06-28 10:01:52, "Jeff Jirsa"  wrote:
>
> You can sometimes bounce your way through it (or use nodetool
> resetlocalschema if it’s a single node that’s wrong), but there are some
> edge cases from which it’s very hard to recover
>
> What’s the output of nodetool describecluster?
>
> If you select from the schema tables, do you see that CFID on any real
> tables?
>
> --
> Jeff Jirsa
>
>
> On Jun 27, 2018, at 7:58 PM, dayu  wrote:
>
> That sound reasonable, I have seen schema mismatch error before.
> So any advise to deal with schema mismatches?
>
> Dayu
>
> At 2018-06-28 09:50:37, "Jeff Jirsa"  wrote:
> >That log message says you did:
> >
> > CF 53f6d520-2dc6-11e8-948d-ab7caa3c8c36 was dropped during streaming
> >
> >If you’re absolutely sure you didn’t, you should look for schema mismatches 
> >in your cluster
> >
> >
> >--
> >Jeff Jirsa
> >
> >
> >> On Jun 27, 2018, at 7:49 PM, dayu  wrote:
> >>
> >> CF 53f6d520-2dc6-11e8-948d-ab7caa3c8c36 was dropped during streaming
> >
> >-
> >To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>
>
>
>
>
>
>

Re: Is it ok to add more than one node to a exist cluster

2018-06-27 Thread kurt greaves

To clarify, this is after the node reaches NORMAL state. don't bootstrap
other nodes while one is in JOINING unless you really know what you are
doing and aren't using vnodes.

On Wed., 27 Jun. 2018, 21:02 Abdul Patel,  wrote:

> Theres always an 2 minute rule ..after adding one node wait for 2 minutes
> before addding second node.
>
> On Wednesday, June 27, 2018, dayu  wrote:
>
>> Thanks for your reply, kurt.
>>
>> another question, Can I bootstrap a new node when some node is in Joining
>> state ? Or I should wait until Joining node becoming Normal ?
>>
>> Dayu
>>
>>
>>
>> At 2018-06-27 17:50:34, "kurt greaves"  wrote:
>>
>> Don't bootstrap nodes simultaneously unless you really know what you're
>> doing, and you're using single tokens. It's not straightforward and will
>> likely lead to data loss/inconsistencies. This applies for all current
>> versions.
>>
>> On 27 June 2018 at 10:21, dayu  wrote:
>>
>>> Hi,
>>> I have read a warning of not Simultaneously bootstrapping more than
>>> one new node from the same rack in version 2.1  link
>>> <https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/ops_add_node_to_cluster_t.html>
>>> My cassandra cluster version is 3.0.10.
>>> So I wonder to know Is it ok to add more than one node to a exist
>>> cluster in 3.0.10.
>>>
>>> Thanks!
>>> Dayu
>>>
>>>
>>>
>>>
>>
>>
>>
>>
>>
>

Re: Is it ok to add more than one node to a exist cluster

2018-06-27 Thread kurt greaves

Don't bootstrap nodes simultaneously unless you really know what you're
doing, and you're using single tokens. It's not straightforward and will
likely lead to data loss/inconsistencies. This applies for all current
versions.

On 27 June 2018 at 10:21, dayu  wrote:

> Hi,
> I have read a warning of not Simultaneously bootstrapping more than
> one new node from the same rack in version 2.1  link
> 
> My cassandra cluster version is 3.0.10.
> So I wonder to know Is it ok to add more than one node to a exist
> cluster in 3.0.10.
>
> Thanks!
> Dayu
>
>
>
>

Re: 3.11.2 memory leak

2018-06-19 Thread kurt greaves

At this point I'd wait for 3.11.3. If you can't, you can get away with
backporting a few repair fixes or just doing sub range repairs on 3.11.2

On Wed., 20 Jun. 2018, 01:10 Abdul Patel,  wrote:

> Hi All,
>
> Do we kmow whats the stable version for now if u wish to upgrade ?
>
> On Tuesday, June 5, 2018, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Jeff,
>>
>>
>>
>> FWIW, when talking about
>> https://issues.apache.org/jira/browse/CASSANDRA-13929, there is a patch
>> available since March without getting further attention.
>>
>>
>>
>> Regards,
>>
>> Thomas
>>
>>
>>
>> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
>> *Sent:* Dienstag, 05. Juni 2018 00:51
>> *To:* cassandra 
>> *Subject:* Re: 3.11.2 memory leak
>>
>>
>>
>> There have been a few people who have reported it, but nobody (yet) has
>> offered a patch to fix it. It would be good to have a reliable way to
>> repro, and/or an analysis of a heap dump demonstrating the problem (what's
>> actually retained at the time you're OOM'ing).
>>
>>
>>
>> On Mon, Jun 4, 2018 at 6:52 AM, Abdul Patel  wrote:
>>
>> Hi All,
>>
>>
>>
>> I recently upgraded my non prod cluster from 3.10 to 3.11.2.
>>
>> It was working fine for a 1.5 weeks then suddenly nodetool info startee
>> reporting 80% and more memory consumption.
>>
>> Intially it was 16gb configured, then i bumped to 20gb and rebooted all 4
>> nodes of cluster-single DC.
>>
>> Now after 8 days i again see 80% + usage and its 16gb and above ..which
>> we never saw before .
>>
>> Seems like memory leak bug?
>>
>> Does anyone has any idea ? Our 3.11.2 release rollout has been halted
>> because of this.
>>
>> If not 3.11.2 whats the next best stable release we have now?
>>
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>>
>

Re: Timestamp on hints file and system.hints table data

2018-06-18 Thread kurt greaves

Send through some examples (and any errors)? Sounds like the file might be
corrupt. Not that there's much you can do about that. You can try stopping
C*, deleting the file, then starting C* again. You'll have to repair,
assuming you haven't repaired already since that hint file was created.

On 18 June 2018 at 13:56, learner dba 
wrote:

> Yes Kurt, system log is flooded with hints sent and replayed messages.
>
> On Monday, June 18, 2018, 7:30:34 AM EDT, kurt greaves <
> k...@instaclustr.com> wrote:
>
>
> Not sure what to make of that. Are there any log messages regarding the
> file and replaying hints? Sounds like maybe it's corrupt (although not sure
> why it keeps getting rewritten).
>
> On 14 June 2018 at 13:19, Nitan Kainth  wrote:
>
> Kurt,
>
> Hint file matches UUID matches with another node in the cluster:
>
> -rw-r--r--. 1 root root  6848246 May 13 23:37 1b694180-210a-4b75-8f2a-
> 748f4a5b6a3d-1526254645089-1. hints
>
> /opt/cassandra/bin/nodetool status |grep 1b694180
>
> UN  x.x.x.   23.77 GiB  256  ?   1b694180-210a-4b75-8f2a-
> 748f4a5b6a3d  RAC1
>
>
>
> On Thu, Jun 14, 2018 at 12:45 AM, kurt greaves 
> wrote:
>
> Does the UUID on the filename correspond with a UUID in nodetool status?
>
> Sounds to me like it could be something weird with an old node that no
> longer exists, although hints for old nodes are meant to be cleaned up.
>
> On 14 June 2018 at 01:54, Nitan Kainth  wrote:
>
> Kurt,
>
> No node is down for months. And yes, I am surprised to look at Unix
> timestamp on files.
>
>
>
> On Jun 13, 2018, at 6:41 PM, kurt greaves  wrote:
>
> system.hints is not used in Cassandra 3. Can't explain the files though,
> are you referring to the files timestamp or the Unix timestamp in the file
> name? Is there a node that's been down for several months?
>
> On Wed., 13 Jun. 2018, 23:41 Nitan Kainth,  wrote:
>
> Hi,
>
> I observed a strange behavior about stored hints.
>
> Time stamp of hints file shows several months old. I deleted them and saw
> new hints files created with same old date. Why is that?
>
> Also, I see hints files on disk but if I query system.hints table, it
> shows 0 rows. Why system.hints is not populated?
>
> Version 3.11-1
>
>
>
>
>

Re: Timestamp on hints file and system.hints table data

2018-06-18 Thread kurt greaves

Not sure what to make of that. Are there any log messages regarding the
file and replaying hints? Sounds like maybe it's corrupt (although not sure
why it keeps getting rewritten).

On 14 June 2018 at 13:19, Nitan Kainth  wrote:

> Kurt,
>
> Hint file matches UUID matches with another node in the cluster:
>
> -rw-r--r--. 1 root root  6848246 May 13 23:37 1b694180-210a-4b75-8f2a-
> 748f4a5b6a3d-1526254645089-1.hints
>
> /opt/cassandra/bin/nodetool status |grep 1b694180
>
> UN  x.x.x.   23.77 GiB  256  ?   1b694180-210a-4b75-8f2a-
> 748f4a5b6a3d  RAC1
>
>
>
> On Thu, Jun 14, 2018 at 12:45 AM, kurt greaves 
> wrote:
>
>> Does the UUID on the filename correspond with a UUID in nodetool status?
>>
>> Sounds to me like it could be something weird with an old node that no
>> longer exists, although hints for old nodes are meant to be cleaned up.
>>
>> On 14 June 2018 at 01:54, Nitan Kainth  wrote:
>>
>>> Kurt,
>>>
>>> No node is down for months. And yes, I am surprised to look at Unix
>>> timestamp on files.
>>>
>>>
>>>
>>> On Jun 13, 2018, at 6:41 PM, kurt greaves  wrote:
>>>
>>> system.hints is not used in Cassandra 3. Can't explain the files though,
>>> are you referring to the files timestamp or the Unix timestamp in the file
>>> name? Is there a node that's been down for several months?
>>>
>>> On Wed., 13 Jun. 2018, 23:41 Nitan Kainth, 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I observed a strange behavior about stored hints.
>>>>
>>>> Time stamp of hints file shows several months old. I deleted them and
>>>> saw new hints files created with same old date. Why is that?
>>>>
>>>> Also, I see hints files on disk but if I query system.hints table, it
>>>> shows 0 rows. Why system.hints is not populated?
>>>>
>>>> Version 3.11-1
>>>>
>>>
>>
>

Re:

2018-06-18 Thread kurt greaves

>
> 1) Am I correct to assume that the larger page size some user session has
> set - the larger portion of cluster/coordinator node resources will be
> hogged by the corresponding session?
> 2) Do I understand correctly that page size (imagine we have no timeout
> settings) is limited by RAM and iops which I want to hand down to a single
> user session?

Yes for both of the above. More rows will be pulled into memory
simultaneously with a larger page size, thus using more memory and IO.

3) Am I correct to assume that the page size/read request timeout allowance
> I set is direct representation of chance to lock some node to single user's
> requests?

Concurrent reads can occur on a node, so it shouldn't "lock" the node to a
single users request. However you can overload the node, which may be
effectively the same thing. Don't set page sizes too high, otherwise the
coordinator of the query will end up doing a lot of GC.

Re: Compaction strategy for update heavy workload

2018-06-13 Thread kurt greaves

>
> I wouldn't use TWCS if there's updates, you're going to risk having
> data that's never deleted and really small sstables sticking around
> forever.

How do you risk having data sticking around forever when everything is
TTL'd?

If you use really large buckets, what's the point of TWCS?

No one said anything about really large buckets. I'd also note that if the
data was so small per partition it would be entirely reasonable to not
bucket by partition key (and window) and thus updates would become
irrelevant.

Honestly this is such a small workload you could easily use STCS or
> LCS and you'd likely never, ever see a problem.


While the numbers sound small, there must be some logical reason to have so
many nodes. In my experience STCS and LCS both have their own drawbacks in
regards to updates, more so when you have high data density, which sounds
like it might be the case here. It's not hard to test these things and it's
important to get these things right at the start to save yourself some
serious pain down the track.

On 13 June 2018 at 22:41, Jonathan Haddad  wrote:

> I wouldn't use TWCS if there's updates, you're going to risk having
> data that's never deleted and really small sstables sticking around
> forever.  If you use really large buckets, what's the point of TWCS?
>
> Honestly this is such a small workload you could easily use STCS or
> LCS and you'd likely never, ever see a problem.
> On Wed, Jun 13, 2018 at 3:34 PM kurt greaves  wrote:
> >
> > TWCS is probably still worth trying. If you mean updating old rows in
> TWCS "out of order updates" will only really mean you'll hit more SSTables
> on read. This might add a bit of complexity in your client if your
> bucketing partitions (not strictly necessary), but that's about it. As long
> as you're not specifying "USING TIMESTAMP" you still get the main benefit
> of efficient dropping of SSTables - C* only cares about the write timestamp
> of the data in regards to TTL's, not timestamps stored in your
> partition/clustering key.
> > Also keep in mind that you can specify the window size in TWCS, so if
> you can increase it enough to cover the "out of order" updates then that
> will also solve the problem w.r.t old buckets.
> >
> > In regards to LCS, the only way to really know if it'll be too much
> compaction overhead is to test it, but for the most part you should
> consider your read/write ratio, rather than the total number of
> reads/writes (unless it's so small that it's irrelevant, which it may well
> be).
> >
> > On 13 June 2018 at 19:25, manuj singh  wrote:
> >>
> >> Hi all,
> >> I am trying to determine compaction strategy for our use case.
> >> In our use case we will have updates on a row a few times. And we have
> a ttl also defined on the table level.
> >> Our typical workload is less then 1000 writes + reads per second. At
> the max it could go up to 2500 per second.
> >> We use SSD and have around 64 gb of ram on each node. Our cluster size
> is around 70 nodes.
> >>
> >> I looked at time series but we cant guarantee that the updates will
> happen within a give time window. And if we have out of order updates it
> might impact on when we remove that data from the disk.
> >>
> >> So i was looking at level tiered, which supposedly is good when you
> have updates. However its io bound and will affect the writes. everywhere i
> read it says its not good for write heavy workload.
> >> But Looking at our write velocity, is it really write heavy ?
> >>
> >> I guess what i am trying to find out is will level tiered compaction
> will impact the writes in our use case or it will be fine given our write
> rate is not that much.
> >> Also is there anything else i should keep in mind while deciding on the
> compaction strategy.
> >>
> >> Thanks!!
> >
> >
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> twitter: rustyrazorblade
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Timestamp on hints file and system.hints table data

2018-06-13 Thread kurt greaves

Does the UUID on the filename correspond with a UUID in nodetool status?

Sounds to me like it could be something weird with an old node that no
longer exists, although hints for old nodes are meant to be cleaned up.

On 14 June 2018 at 01:54, Nitan Kainth  wrote:

> Kurt,
>
> No node is down for months. And yes, I am surprised to look at Unix
> timestamp on files.
>
>
>
> On Jun 13, 2018, at 6:41 PM, kurt greaves  wrote:
>
> system.hints is not used in Cassandra 3. Can't explain the files though,
> are you referring to the files timestamp or the Unix timestamp in the file
> name? Is there a node that's been down for several months?
>
> On Wed., 13 Jun. 2018, 23:41 Nitan Kainth,  wrote:
>
>> Hi,
>>
>> I observed a strange behavior about stored hints.
>>
>> Time stamp of hints file shows several months old. I deleted them and saw
>> new hints files created with same old date. Why is that?
>>
>> Also, I see hints files on disk but if I query system.hints table, it
>> shows 0 rows. Why system.hints is not populated?
>>
>> Version 3.11-1
>>
>

Re: Timestamp on hints file and system.hints table data

2018-06-13 Thread kurt greaves

system.hints is not used in Cassandra 3. Can't explain the files though,
are you referring to the files timestamp or the Unix timestamp in the file
name? Is there a node that's been down for several months?

On Wed., 13 Jun. 2018, 23:41 Nitan Kainth,  wrote:

> Hi,
>
> I observed a strange behavior about stored hints.
>
> Time stamp of hints file shows several months old. I deleted them and saw
> new hints files created with same old date. Why is that?
>
> Also, I see hints files on disk but if I query system.hints table, it
> shows 0 rows. Why system.hints is not populated?
>
> Version 3.11-1
>

Re: Compaction strategy for update heavy workload

2018-06-13 Thread kurt greaves

TWCS is probably still worth trying. If you mean updating old rows in TWCS
"out of order updates" will only really mean you'll hit more SSTables on
read. This might add a bit of complexity in your client if your bucketing
partitions (not strictly necessary), but that's about it. As long as you're
not specifying "USING TIMESTAMP" you still get the main benefit of
efficient dropping of SSTables - C* only cares about the *write timestamp* of
the data in regards to TTL's, not timestamps stored in your
partition/clustering key.
Also keep in mind that you can specify the window size in TWCS, so if you
can increase it enough to cover the "out of order" updates then that will
also solve the problem w.r.t old buckets.

In regards to LCS, the only way to really know if it'll be too much
compaction overhead is to test it, but for the most part you should
consider your read/write ratio, rather than the total number of
reads/writes (unless it's so small that it's irrelevant, which it may well
be).

On 13 June 2018 at 19:25, manuj singh  wrote:

> Hi all,
> I am trying to determine compaction strategy for our use case.
> In our use case we will have updates on a row a few times. And we have a
> ttl also defined on the table level.
> Our typical workload is less then 1000 writes + reads per second. At the
> max it could go up to 2500 per second.
> We use SSD and have around 64 gb of ram on each node. Our cluster size is
> around 70 nodes.
>
> I looked at time series but we cant guarantee that the updates will happen
> within a give time window. And if we have out of order updates it might
> impact on when we remove that data from the disk.
>
> So i was looking at level tiered, which supposedly is good when you have
> updates. However its io bound and will affect the writes. everywhere i read
> it says its not good for write heavy workload.
> But Looking at our write velocity, is it really write heavy ?
>
> I guess what i am trying to find out is will level tiered compaction will
> impact the writes in our use case or it will be fine given our write rate
> is not that much.
> Also is there anything else i should keep in mind while deciding on the
> compaction strategy.
>
> Thanks!!
>

Re: Migrating to Reaper: Switching From Incremental to Reaper's Full Subrange Repair

2018-06-13 Thread kurt greaves

Not strictly necessary but probably a good idea as you don't want two
separate pools of SSTables unnecessarily. Also if you've set
"only_purge_repaired_tombstones" you'll need to turn that off.

On Wed., 13 Jun. 2018, 23:06 Fd Habash,  wrote:

> For those who are using Reaper …
>
>
>
> Currently, I'm run repairs using crontab/nodetool using 'repair -pr' on
> 2.2.8 which defaults to incremental. If I migrate to Reaper, do I have to
> mark sstables as un-repaired first? Also, out of the box, does Reaper run
> full parallel repair? If yes, is it not going to cause over-streaming since
> we are repairing ranges multiple times?
>
>
>
> 
> Thank you
>
>
>

Re: Cassandra 3.0.X migarte to VPC

2018-06-07 Thread kurt greaves

>
> I meant migrating to gosspsnitch during adding new dc. New dc will be
> empty so all the data will be streamed based on snitch property chosen


Should work fine on the new DC, as long as the original DC is using a
snitch that supports datacenters - then just don't mix and match snitches
within a DC.

Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage

2018-05-29 Thread kurt greaves

Good to know. So that confirms it's just the GC threads causing problems.

On Tue., 29 May 2018, 22:02 Steinmaurer, Thomas, <
thomas.steinmau...@dynatrace.com> wrote:

> Kurt,
>
>
>
> in our test it also didn’t made a difference with the default number of GC
> Threads (43 on our large machine) and running with Xmx128M or XmX31G
> (derived from $MAX_HEAP_SIZE). For both Xmx, we saw the high CPU caused by
> nodetool.
>
>
>
> Regards,
>
> Thomas
>
>
>
> *From:* kurt greaves [mailto:k...@instaclustr.com]
> *Sent:* Dienstag, 29. Mai 2018 13:06
> *To:* User 
> *Subject:* Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage
>
>
>
> Thanks Thomas. After a bit more research today I found that the whole
> $MAX_HEAP_SIZE issue isn't really a problem because we don't explicitly set
> -Xms so the minimum heapsize by default will be 256mb, which isn't hugely
> problematic, and it's unlikely more than that would get allocated.
>
>
>
> On 29 May 2018 at 09:29, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hi Kurt,
>
>
>
> thanks for pointing me to the Xmx issue.
>
>
>
> JIRA + patch (for Linux only based on C* 3.11) for the parallel GC thread
> issue is available here:
> https://issues.apache.org/jira/browse/CASSANDRA-14475
>
>
>
> Thanks,
>
> Thomas
>
>
>
> *From:* kurt greaves [mailto:k...@instaclustr.com]
> *Sent:* Dienstag, 29. Mai 2018 05:54
> *To:* User 
> *Subject:* Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage
>
>
>
> 1) nodetool is reusing the $MAX_HEAP_SIZE environment variable, thus if we
> are running Cassandra with e.g. Xmx31G, nodetool is started with Xmx31G as
> well
>
> This was fixed in 3.0.11/3.10 in CASSANDRA-12739
> <https://issues.apache.org/jira/browse/CASSANDRA-12739>. Not sure why it
> didn't make it into 2.1/2.2.
>
> 2) As -XX:ParallelGCThreads is not explicitly set upon startup, this
> basically defaults to a value dependent on the number of cores. In our
> case, with the machine above, the number of parallel GC threads for the JVM
> is set to 43!
> 3) Test-wise, we have adapted the nodetool startup script in a way to get
> a Java Flight Recording file on JVM exit, thus with each nodetool
> invocation we can inspect a JFR file. Here we may have seen System.gc()
> calls (without visible knowledge where they come from), GC times for the
> entire JVM life-time (e.g. ~1min) showing high cpu. This happened for both
> Xmx128M (default as it seems) and Xmx31G
>
> After explicitly setting -XX:ParallelGCThreads=1 in the nodetool startup
> script, CPU usage spikes by nodetool are entirely gone.
>
> Is this something which has been already adapted/tackled in Cassandra
> versions > 2.1 or worth to be considered as some sort of RFC?
>
> Can you create a JIRA for this (and a patch, if you like)? We should be
> explicitly setting this on nodetool invocations.
>
> 
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
>

Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage

2018-05-29 Thread kurt greaves

Thanks Thomas. After a bit more research today I found that the whole
$MAX_HEAP_SIZE issue isn't really a problem because we don't explicitly set
-Xms so the minimum heapsize by default will be 256mb, which isn't hugely
problematic, and it's unlikely more than that would get allocated.

On 29 May 2018 at 09:29, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hi Kurt,
>
>
>
> thanks for pointing me to the Xmx issue.
>
>
>
> JIRA + patch (for Linux only based on C* 3.11) for the parallel GC thread
> issue is available here: https://issues.apache.org/
> jira/browse/CASSANDRA-14475
>
>
>
> Thanks,
>
> Thomas
>
>
>
> *From:* kurt greaves [mailto:k...@instaclustr.com]
> *Sent:* Dienstag, 29. Mai 2018 05:54
> *To:* User 
> *Subject:* Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage
>
>
>
> 1) nodetool is reusing the $MAX_HEAP_SIZE environment variable, thus if we
> are running Cassandra with e.g. Xmx31G, nodetool is started with Xmx31G as
> well
>
> This was fixed in 3.0.11/3.10 in CASSANDRA-12739
> <https://issues.apache.org/jira/browse/CASSANDRA-12739>. Not sure why it
> didn't make it into 2.1/2.2.
>
> 2) As -XX:ParallelGCThreads is not explicitly set upon startup, this
> basically defaults to a value dependent on the number of cores. In our
> case, with the machine above, the number of parallel GC threads for the JVM
> is set to 43!
> 3) Test-wise, we have adapted the nodetool startup script in a way to get
> a Java Flight Recording file on JVM exit, thus with each nodetool
> invocation we can inspect a JFR file. Here we may have seen System.gc()
> calls (without visible knowledge where they come from), GC times for the
> entire JVM life-time (e.g. ~1min) showing high cpu. This happened for both
> Xmx128M (default as it seems) and Xmx31G
>
> After explicitly setting -XX:ParallelGCThreads=1 in the nodetool startup
> script, CPU usage spikes by nodetool are entirely gone.
>
> Is this something which has been already adapted/tackled in Cassandra
> versions > 2.1 or worth to be considered as some sort of RFC?
>
> Can you create a JIRA for this (and a patch, if you like)? We should be
> explicitly setting this on nodetool invocations.
>
> 
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
>

Re: nodetool (2.1.18) - Xmx, ParallelGCThreads, High CPU usage

2018-05-28 Thread kurt greaves

>
> 1) nodetool is reusing the $MAX_HEAP_SIZE environment variable, thus if we
> are running Cassandra with e.g. Xmx31G, nodetool is started with Xmx31G as
> well

This was fixed in 3.0.11/3.10 in CASSANDRA-12739
. Not sure why it
didn't make it into 2.1/2.2.

> 2) As -XX:ParallelGCThreads is not explicitly set upon startup, this
> basically defaults to a value dependent on the number of cores. In our
> case, with the machine above, the number of parallel GC threads for the JVM
> is set to 43!
> 3) Test-wise, we have adapted the nodetool startup script in a way to get
> a Java Flight Recording file on JVM exit, thus with each nodetool
> invocation we can inspect a JFR file. Here we may have seen System.gc()
> calls (without visible knowledge where they come from), GC times for the
> entire JVM life-time (e.g. ~1min) showing high cpu. This happened for both
> Xmx128M (default as it seems) and Xmx31G
>
> After explicitly setting -XX:ParallelGCThreads=1 in the nodetool startup
> script, CPU usage spikes by nodetool are entirely gone.
>
> Is this something which has been already adapted/tackled in Cassandra
> versions > 2.1 or worth to be considered as some sort of RFC?
>
Can you create a JIRA for this (and a patch, if you like)? We should be
explicitly setting this on nodetool invocations.

Re: performance on reading only the specific nonPk column

2018-05-21 Thread kurt greaves

Every column will be retrieved (that's populated) from disk and the
requested column will then be sliced out in memory and sent back.

On 21 May 2018 at 08:34, sujeet jog  wrote:

> Folks,
>
> consider a table with 100 metrics with (id , timestamp ) as key,
> if one wants to do a selective metric read
>
> select m1 from table where id = 10 and timestamp >= '2017-01-02
> :00:00:00'
> and timestamp <= '2017-01-02 04:00:00'
>
> does the read on the specific node happen first bringing all the metrics
> m1 - m100 and then the metric is  sliced in memory and retrieve ,  or the
> disk read happens only on the sliced data m1 without bringing m1- m100  ?
>
> here partition & clustering key is provided in the query, the question is
> more towards efficiency operation on this schema for read.
>
> create table {
> id : Int,.
> timestamp : timestamp ,
> m1 : Int,
> m2  : Int,
> m3  Int,
> m4  Int,
> ..
> ..
> m100 : Int
>
> Primary Key ( id, timestamp )
> }
>
> Thanks
>

Re: Invalid metadata has been detected for role

2018-05-17 Thread kurt greaves

Can you post the stack trace and you're version of Cassandra?

On Fri., 18 May 2018, 09:48 Abdul Patel,  wrote:

> Hi
>
> I had to decommission one dc , now while adding bacl the same nodes ( i
> used nodetool decommission) they both get added fine and i also see them im
> nodetool status but i am unable to login in them .gives invalid mwtadat
> error, i ran repair and later cleanup as well.
>
> Any ideas?
>
>

Re: row level atomicity and isolation

2018-05-16 Thread kurt greaves

Atomicity and isolation are only guaranteed within a replica. If you have
multiple concurrent requests across replicas last timestamp will win. You
can get better isolation using LWT which uses paxos under the hood.

On 16 May 2018 at 08:55, Rajesh Kishore  wrote:

> Hi,
>
> I am just curious to know when Cassandra doc says the atomicity and
> isolation is guaranteed for a row.
> Does it mean, two requests updating a row- "R1" at different replica will
> be candidate for atomicity and isolation?
>
> For instance , I have a setup where RF is 2
> I have a client application where two requests of updating a particular
> row #R1 goes to two coordinator nodes at same time.
> Row "R1" has presence in nodes - N1, N2 (since RF is 2)
> Does Cassandra ensure atomicity & isolation across replicas/partition for
> a particular row? If so , then how does it get handled does Cassandra
> follows 2 Phase commit txn for a row or Cassandra uses distributed lock for
> a row?
>
> Thanks,
> Rajesh
>
>
>

Re: Suggestions for migrating data from cassandra

2018-05-15 Thread kurt greaves

COPY might work but over hundreds of gigabytes you'll probably run into
issues if you're overloaded. If you've got access to Spark that would be an
efficient way to pull down an entire table and dump it out using the
spark-cassandra-connector.

On 15 May 2018 at 10:59, Jing Meng  wrote:

> Hi guys, for some historical reason, our cassandra cluster is currently
> overloaded and operating on that somehow becomes a nightmare. Anyway,
> (sadly) we're planning to migrate cassandra data back to mysql...
>
> So we're not quite clear how to migrating the historical data from
> cassandra.
>
> While as I know there is the COPY command, I wonder if it works in product
> env where more than hundreds gigabytes data are present. And, if it does,
> would it impact server performance significantly?
>
> Apart from that, I know spark-connector can be used to scan data from c*
> cluster, but I'm not that familiar with spark and still not sure whether
> write data to mysql database can be done naturally with spark-connector.
>
> Are there any suggestions/best-practice/read-materials doing this?
>
> Thanks!
>

Re: dtests failing with - ValueError: unsupported hash type md5

2018-05-10 Thread kurt greaves

What command did you run? Probably worth checking that cqlsh is installed
in the virtual environment and that you are executing pytest from within
the virtual env.

On 10 May 2018 at 05:06, Rajiv Dimri  wrote:

> Hi All,
>
>
>
> We have setup a dtest environment to run against Cassandra db version
> 3.11.1 and 3.0.5
>
> As per instruction on https://github.com/apache/cassandra-dtest we have
> setup the environment with python 3.6.5 along with other dependencies.
>
> The server used is Oracle RHEL (Red Hat Enterprise Linux Server release
> 6.6 (Santiago))
>
>
>
> During the multiple tests are failing with specific error as mentioned
> below.
>
>
>
> process = , cmd_args =
> ['cqlsh', 'TRACING ON', None]
>
>
>
> def handle_external_tool_process(process, cmd_args):
>
> out, err = process.communicate()
>
> rc = process.returncode
>
>
>
> if rc != 0:
>
> >   raise ToolError(cmd_args, rc, out, err)
>
> E   ccmlib.node.ToolError: Subprocess ['cqlsh', 'TRACING ON',
> None] exited with non-zero status; exit status: 1;
>
> E   stderr: ERROR:root:code for hash md5 was not found.
>
> E   Traceback (most recent call last):
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 139, in 
>
> E   globals()[__func_name] = __get_hash(__func_name)
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 91, in __get_builtin_constructor
>
> E   raise ValueError('unsupported hash type ' + name)
>
> E   ValueError: unsupported hash type md5
>
> E   ERROR:root:code for hash sha1 was not found.
>
> E   Traceback (most recent call last):
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 139, in 
>
> E   globals()[__func_name] = __get_hash(__func_name)
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 91, in __get_builtin_constructor
>
> E   raise ValueError('unsupported hash type ' + name)
>
> E   ValueError: unsupported hash type sha1
>
> E   ERROR:root:code for hash sha224 was not found.
>
> E   Traceback (most recent call last):
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 139, in 
>
> E   globals()[__func_name] = __get_hash(__func_name)
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 91, in __get_builtin_constructor
>
> E   raise ValueError('unsupported hash type ' + name)
>
> E   ValueError: unsupported hash type sha224
>
> E   ERROR:root:code for hash sha256 was not found.
>
> E   Traceback (most recent call last):
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 139, in 
>
> E   globals()[__func_name] = __get_hash(__func_name)
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 91, in __get_builtin_constructor
>
> E   raise ValueError('unsupported hash type ' + name)
>
> E   ValueError: unsupported hash type sha256
>
> E   ERROR:root:code for hash sha384 was not found.
>
> E   Traceback (most recent call last):
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 139, in 
>
> E   globals()[__func_name] = __get_hash(__func_name)
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 91, in __get_builtin_constructor
>
> E   raise ValueError('unsupported hash type ' + name)
>
> E   ValueError: unsupported hash type sha384
>
> E   ERROR:root:code for hash sha512 was not found.
>
> E   Traceback (most recent call last):
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 139, in 
>
> E   globals()[__func_name] = __get_hash(__func_name)
>
> E File "/ade_autofs/ade_infra/nfsdo_
> linux.x64/PYTHON/2.7.8/LINUX.X64/141106.0120/python/lib/python2.7/hashlib.py",
> line 91, in __get_builtin_constructor
>
> E   raise ValueError('unsupported hash type ' + name)
>
> E   ValueError: unsupported hash type sha512
>
> E   Traceback (most recent call last):
>
> E File

Re: compaction: huge number of random reads

2018-05-07 Thread kurt greaves

If you've got small partitions/small reads you should test lowering your
compression chunk size on the table and disabling read ahead. This sounds
like it might just be a case of read amplification.

On Tue., 8 May 2018, 05:43 Kyrylo Lebediev, 
wrote:

> Dear Experts,
>
>
> I'm observing strange behavior on a cluster 2.1.20 during compactions.
>
>
> My setup is:
>
> 12 nodes  m4.2xlarge (8 vCPU, 32G RAM) Ubuntu 16.04, 2T EBS gp2.
>
> Filesystem: XFS, blocksize 4k, device read-ahead - 4k
>
> /sys/block/vxdb/queue/nomerges = 0
>
> SizeTieredCompactionStrategy
>
>
> After data loads when effectively nothing else is talking to the cluster
> and compactions is the only activity, I see something like this:
> $ iostat -dkx 1
> ...
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> xvda  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> xvdb  0.00 0.00 4769.00  213.00 19076.00 26820.00
> 18.42 7.951.171.063.76   0.20 100.00
>
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
> avgqu-sz   await r_await w_await  svctm  %util
> xvda  0.00 0.000.000.00 0.00 0.00
> 0.00 0.000.000.000.00   0.00   0.00
> xvdb  0.00 0.00 6098.00  177.00 24392.00 22076.00
> 14.81 6.461.360.96   15.16   0.16 100.00
>
> Writes are fine: 177 writes/sec <-> ~22Mbytes/sec,
>
> But for some reason compactions generate a huge number of small reads:
> 6098 reads/s <-> ~24Mbytes/sec.  ===>   Read size is 4k
>
>
> Why instead much smaller amount of large reads I'm getting huge number of
> 4k reads instead?
>
> What could be the reason?
>
> Thanks,
>
> Kyrill
>
>
>

Re: Version Upgrade

2018-05-03 Thread kurt greaves

>
> In other words, if I am running Cassandra 1.2.x and upgrading to 2.0.x,
> 2.0.x will continue to read all the old Cassandra 1.2.x table. However, if
> I then want to upgrade to Cassandra 2.1.x, I’d better make sure all tables
> have been upgraded to 2.0.x before making the next upgrade.


Correct but you should really upgrade SSTables as early as possible to
benefit from any storage/performance improvements. Plus it's probably not
incredibly safe to be running an old format with a newer version for an
extended period.

Re: Shifting data to DCOS

2018-05-02 Thread kurt greaves

Something is not right if it thinks the rf is different. Do you have the
command you ran for repair and the error?

If you are willing to do the operation again I'd be interested to see if
nodetool cleanup causes any data to be removed (you should snapshot the
disks before running this as it will remove data if tokens were incorrect).

On Wed., 2 May 2018, 21:48 Faraz Mateen, <fmat...@an10.io> wrote:

> Hi all,
>
> Sorry I couldn't update earlier as I got caught up in some other stuff.
>
> Anyway, my previous 3 node cluster was on version 3.9.  I created a new
> cluster of cassandra 3.11.2 with same number of nodes on GCE VMs instead of
> DC/OS. My existing cluster has cassandra data on persistent disks. I made
> copies of those disks and attached them to new cluster.
>
> I was using the following link to move data to the new cluster:
>
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsSnapshotRestoreNewCluster.html
>
> As mentioned in the link, I manually assigned token ranges to each node
> according to their corresponding node in the previous cluster. When I
> restarted cassandra process on the VMs, I noticed that it had automatically
> picked up all my keyspaces and column families. I did not recreate schema
> or copy data manually or run sstablesloader. I am not sure if this should
> have happened.
>
> Anyway, the data in both clusters is still not in sync. I ran a simple
> count query on a table both clusters and got different results:
>
> Old cluster: 217699
> New Cluster: 138770
>
> On the new cluster, when I run nodetool repair for my keyspace, it runs
> fine on one node, but on other two nodes it says that keyspace replication
> factor is 1 so repair is not needed. Cqlsh also shows that the replication
> factor is 2.
>
> Nodetool status on new and old cluster shows different outputs for each
> cluster as well.
>
> *Cluster1:*
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens   OwnsHost ID
> Rack
> UN  10.128.1.1  228.14 GiB  256  ?
> 63ff8054-934a-4a7a-a33f-405e064bc8e8  rack1
> UN  10.128.1.2  231.25 GiB  256  ?
> 702e8a31-6441--b569-d2d137d54a5d  rack1
> UN  10.128.1.3  199.91 GiB  256  ?
> b5b22a90-f037-433a-8ad9-f370b26cca26  rack1
>
> *Cluster2:*
> Datacenter: datacenter1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address Load   Tokens   OwnsHost ID
> Rack
> UJ  10.142.0.4  211.27 GiB  256  ?
> c55fef77-9c78-449c-b0d9-64e755caee7d  rack1
> UN  10.142.0.2  228.14 GiB  256  ?
> 0065c8e1-47be-4cf8-a3fe-3f4d20ff1b47  rack1
> UJ  10.142.0.3  241.77 GiB  256  ?
> f3b3f409-d108-4751-93ba-682692e46318  rack1
>
> This is weird because both the clusters have essentially same disks
> attached to them.
>  Only one node (10.142.0.2) in cluster2 has the same load as its
> counterpart in the cluster1 (10.128.1.1).
> This is also the node where nodetool repair seems to be running fine and
> it is also acting as the seed node in second cluster.
>
> I am confused that what might be causing this inconsistency in load and
> replication factor?  Has anyone ever seen different replication factor for
> same keyspace on different nodes? Is there a problem in my workflow?
> Can anyone please suggest the best way to move data from one cluster to
> another?
>
> Any help will be greatly appreciated.
>
> On Tue, Apr 17, 2018 at 6:52 AM, Faraz Mateen <fmat...@an10.io> wrote:
>
>> Thanks for the response guys.
>>
>> Let me try setting token ranges manually and move the data again to
>> correct nodes. Will update with the outcome soon.
>>
>>
>> On Tue, Apr 17, 2018 at 5:42 AM, kurt greaves <k...@instaclustr.com>
>> wrote:
>>
>>> Sorry for the delay.
>>>
>>>> Is the problem related to token ranges? How can I find out token range
>>>> for each node?
>>>> What can I do to further debug and root cause this?
>>>
>>> Very likely. See below.
>>>
>>> My previous cluster has 3 nodes but replication factor is 2. I am not
>>>> exactly sure how I would handle the tokens. Can you explain that a bit?
>>>
>>> The new cluster will have to have the same token ring as the old if you
>>> are copying from node to node. Basically you should get the set of tokens
>>> for each node (from nodetool ring) and when you spin up your 3 new nodes,
>>> set initial_tokens in the yaml to be the comma-separated list of tokens for 
>>> *exactly
>>> one* node from the previous cluster. When restoring the SSTables you
>>> need to make sure you take the SSTables from the original node and place it
>>> on the new node that has the *same* list of tokens. If you don't do
>>> this it won't be a replica for all the data in those SSTables and
>>> consequently you'll lose data (or it simply won't be available).
>>> 
>>>
>>
>>
>>
>> --
>> Faraz Mateen
>>
>
>
>
> --
> Faraz Mateen
>

Re: Determining active sstables and table- dir

2018-05-01 Thread kurt greaves

In 2.2 it's cf_id from system.schema_columnfamilies. If it's not then
that's a bug. From 2.2 we stopped including table name in the SSTable name,
so whatever directory contains the SSTables is the active one. Conversely,
if you've dropped a table and re-added it, the directory without any
SSTables is the dropped table, and if you had auto_snapshot enabled it will
have a snapshots directory in there with a snapshot at the time of the drop.

On 27 April 2018 at 20:24, Carl Mueller 
wrote:

> IN cases where a table was dropped and re-added, there are now two table
> directories with different uuids with sstables.
>
> If you don't have knowledge of which one is active, how do you determine
> which is the active table directory? I have tried cf_id from
> system.schema_columnfamilies and that can work some of the time but have
> seen times cf_id != table-
>
> I have also seen situations where sstables that don't have the
> table/columnfamily are in the table dir and are clearly that active
> sstables (they compacted when I did a nodetool compact)
>
> Is there a way to get a running cassandra node's sstables for a given
> keyspace/table and what table- is active?
>
> This is in a 2.2.x environment that has probably churned a bit from 2.1.x
>

Re: Regular NullPointerExceptions from `nodetool compactionstats` on 3.7 node

2018-04-25 Thread kurt greaves

Typically have seen that in the past when the node is overloaded. Is that a
possibility for you? If it works consistently after restarting C* it's
likely the issue.

On 20 April 2018 at 19:27, Paul Pollack  wrote:

> Hi all,
>
> We have a cluster running on Cassandra 3.7 (we already know this is
> considered a "bad" version and plan to upgrade to 3.11 in the
> not-too-distant future) and we have a few Nagios checks that run `nodetool
> compactionstats` to check how many pending compactions there currently are,
> as well as bytes remaining for compactions to see if they will push us past
> our comfortable disk utilization threshold.
>
> The check regularly fails with an exit code of 2, and then shortly after
> will run successfully, resulting in a check that flaps.
>
> When I am able to reproduce the issue, the output looks like this:
>
> ubuntu@statistic-timelines-11:~$ nodetool compactionstats
> error: null
> -- StackTrace --
> java.lang.NullPointerException
>
> ubuntu@statistic-timelines-11:~$ echo $?
> 2
>
> I've seen this issue
>  for 3.0.11 that
> was fixed and seems slightly different since in this case, something is
> swallowing the full stack trace.
>
> So given all this I have a few questions:
> - Has anyone seen this before and have an idea as to what might cause it?
> - Is it possible that I have something misconfigured that's swallowing the
> stack trace?
> - Should I file an issue in the Cassandra JIRA for this?
>
> Thanks,
> Paul
>

Re: Memtable type and size allocation

2018-04-23 Thread kurt greaves

Hi Vishal,

In Cassandra 3.11.2, there are 3 choices for the type of Memtable
> allocation and as per my understanding, if I want to keep Memtables on JVM
> heap I can use heap_buffers and if I want to store Memtables outside of JVM
> heap then I've got 2 options offheap_buffers and offheap_objects.

Heap buffers == everything is allocated on heap, e.g the entire row and its
contents.
Offheap_buffers is partially on heap partially offheap. It moves the Cell
name + value to offheap buffers. Not sure how much this has changed in 3.x
Offheap_objects moves entire cells offheap and we only keep a reference to
them on heap.

Also, the permitted memory space to be used for Memtables can be set at 2
> places in the YAML file, i.e. memtable_heap_space_in_mb and
> memtable_offheap_space_in_mb.

 Do I need to configure some space in both heap and offheap, irrespective
> of the Memtable allocation type or do I need to set only one of them based
> on my Memtable allocation type i.e. memtable_heap_space_in_mb when using
> heap buffers and memtable_offheap_space_in_mb only when using either of the
> other 2 offheap options?


Both are still relevant and used if using offheap. If not using an offheap
option only memtable_heap_space_in_mb is relevant. For the most part, the
defaults (1/4 of heap size) should be sufficient.

Re: SSTable count in Nodetool tablestats(LevelCompactionStrategy)

2018-04-20 Thread kurt greaves

I'm currently investigating this issue on one of our clusters (but much
worse, we're seeing >100 SSTables and only 2 in the levels) on 3.11.1. What
version are you using? It's definitely a bug.

On 17 April 2018 at 10:09,  wrote:

> Dear Community,
>
>
>
> One of the tables in my keyspace is using LevelCompactionStrategy and when
> I used the nodetool tablestats keyspace.table_name command, I found some
> mismatch in the count of SSTables displayed at 2 different places. Please
> refer the attached image.
>
>
>
> The command is giving SSTable count = 6 but if you add the numbers shown
> against SSTables in each level, then that comes out as 5. Why is there a
> difference?
>
>
>
> Thanks and regards,
>
> Vishal Sharma
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

Re: Phantom growth resulting automatically node shutdown

2018-04-19 Thread kurt greaves

This was fixed (again) in 3.0.15.
https://issues.apache.org/jira/browse/CASSANDRA-13738

On Fri., 20 Apr. 2018, 00:53 Jeff Jirsa,  wrote:

> There have also been a few sstable ref counting bugs that would over
> report load in nodetool ring/status due to overlapping normal and
> incremental repairs (which you should probably avoid doing anyway)
>
> --
> Jeff Jirsa
>
>
> On Apr 19, 2018, at 9:27 AM, Rahul Singh 
> wrote:
>
> I’ve seen something similar in 2.1. Our issue was related to file
> permissions being flipped due to an automation and C* stopped seeing
> Sstables so it started making new data — via read repair or repair
> processes.
>
> In your case if nodetool is reporting data that means that it’s growing
> due to data growth. What does your cfstats / tablestats day? Are you
> monitoring your key tables data via cfstats metrics like SpaceUsedLive or
> SpaceUsedTotal. What is your snapshottjng / backup process doing?
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 19, 2018, 7:01 AM -0500, horschi , wrote:
>
> Did you check the number of files in your data folder before & after the
> restart?
>
> I have seen cases where cassandra would keep creating sstables, which
> disappeared on restart.
>
> regards,
> Christian
>
>
> On Thu, Apr 19, 2018 at 12:18 PM, Fernando Neves  > wrote:
>
>> I am facing one issue with our Cassandra cluster.
>>
>> Details: Cassandra 3.0.14, 12 nodes, 7.4TB(JBOD) disk size in each node,
>> ~3.5TB used physical data in each node, ~42TB whole cluster and default
>> compaction setup. This size maintain the same because after the retention
>> period some tables are dropped.
>>
>> Issue: Nodetool status is not showing the correct used size in the
>> output. It keeps increasing the used size without limit until automatically
>> node shutdown or until our sequential scheduled restart(workaround 3 times
>> week). After the restart, nodetool shows the correct used space but for few
>> days.
>> Did anybody have similar problem? Is it a bug?
>>
>> Stackoverflow:
>> https://stackoverflow.com/questions/49668692/cassandra-nodetool-status-is-not-showing-correct-used-space
>>
>>
>

Re: Token range redistribution

2018-04-19 Thread kurt greaves

That's assuming your data is perfectly consistent, which is unlikely.
Typically that strategy is a bad idea and you should avoid it.

On Thu., 19 Apr. 2018, 07:00 Richard Gray, <richard.g...@smxemail.com>
wrote:

> On 2018-04-18 21:28, kurt greaves wrote:
> > replacing. Simply removing and adding back a new node without replace
> > address will end up with the new node having different tokens, which
> > would mean data loss in the use case you described.
>
> If you have replication factor N > 1, you haven't necessarily lost data
> unless you've swapped out N or more nodes (without using
> replace_address). If you've swapped out fewer than N nodes, you should
> still be able to restore consistency by running a repair.
>
> --
> Richard Gray
>
> _
>
> This email has been filtered by SMX. For more info visit
> http://smxemail.com
>
> _
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Token range redistribution

2018-04-18 Thread kurt greaves

A new node always generates more tokens. A replaced node using
replace_address[_on_first_boot] will reclaim the tokens of the node it's
replacing. Simply removing and adding back a new node without replace
address will end up with the new node having different tokens, which would
mean data loss in the use case you described.

On Wed., 18 Apr. 2018, 16:51 Akshit Jain,  wrote:

> Hi,
> If i replace a node does it redistributes the token range or when the node
> again joins will it be allocated a new token range.
>
> Use case:
> I have booted a C* on AWS. I terminated a node and then boot a new node
> assigned it the same ip and made it join the cluster.
>
> In this case would the token range be redistributed and the node will get
> the new token range.
> Would the process be different for seed nodes?
>
> Regards
> Akshit Jain
>

Re: about the tombstone and hinted handoff

2018-04-16 Thread kurt greaves

I don't think that's true/maybe that comment is misleading. Tombstones
AFAIK will be propagated by hints, and the hint system doesn't do anything
to check if a particular row has been tombstoned. To the node receiving the
hints it just looks like it's receiving a bunch of writes, it doesn't know
they are hints.

On 12 April 2018 at 13:51, Jinhua Luo  wrote:

> Hi All,
>
> In the doc:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dml
> AboutDeletes.html
>
> It said "When an unresponsive node recovers, Cassandra uses hinted
> handoff to replay the database mutationsthe node missed while it was
> down. Cassandra does not replay a mutation for a tombstoned record
> during its grace period.".
>
> The tombstone here is on the recovered node or coordinator?
> The tombstone is a special write record, so it must have writetime.
> We could compare the writetime between the version in the hint and the
> version of the tombstone, which is enough to make choice, so why we
> need to wait for gc_grace_seconds here?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Shifting data to DCOS

2018-04-16 Thread kurt greaves

Sorry for the delay.

> Is the problem related to token ranges? How can I find out token range for
> each node?
> What can I do to further debug and root cause this?

Very likely. See below.

My previous cluster has 3 nodes but replication factor is 2. I am not
> exactly sure how I would handle the tokens. Can you explain that a bit?

The new cluster will have to have the same token ring as the old if you are
copying from node to node. Basically you should get the set of tokens for
each node (from nodetool ring) and when you spin up your 3 new nodes, set
initial_tokens in the yaml to be the comma-separated list of tokens
for *exactly
one* node from the previous cluster. When restoring the SSTables you need
to make sure you take the SSTables from the original node and place it on
the new node that has the *same* list of tokens. If you don't do this it
won't be a replica for all the data in those SSTables and consequently
you'll lose data (or it simply won't be available).

Re: Many SSTables only on one node

2018-04-09 Thread kurt greaves

If there were no other messages about anti-compaction similar to:
>
> SSTable YYY (ranges) will be anticompacted on range [range]


Then no anti-compaction needed to occur and yes, it was not the cause.

On 5 April 2018 at 13:52, Dmitry Simonov  wrote:

> Hi, Evelyn!
>
> I've found the following messages:
>
> INFO RepairRunnable.java Starting repair command #41, repairing keyspace
> XXX with repair options (parallelism: parallel, primary range: false,
> incremental: false, job threads: 1, ColumnFamilies: [YYY], dataCenters: [],
> hosts: [], # of ranges: 768)
> INFO CompactionExecutor:6 CompactionManager.java Starting anticompaction
> for XXX.YYY on 5132/5846 sstables
>
> After that many similar messages go:
> SSTable BigTableReader(path='/mnt/cassandra/data/XXX/YYY-
> 4c12fd9029e611e8810ac73ddacb37d1/lb-12688-big-Data.db') fully contained
> in range (-9223372036854775808,-9223372036854775808], mutating repairedAt
> instead of anticompacting
>
> Does it means that anti-compaction is not the cause?
>
> 2018-04-05 18:01 GMT+05:00 Evelyn Smith :
>
>> It might not be what cause it here. But check your logs for
>> anti-compactions.
>>
>>
>> On 5 Apr 2018, at 8:35 pm, Dmitry Simonov  wrote:
>>
>> Thank you!
>> I'll check this out.
>>
>> 2018-04-05 15:00 GMT+05:00 Alexander Dejanovski :
>>
>>> 40 pending compactions is pretty high and you should have way less than
>>> that most of the time, otherwise it means that compaction is not keeping up
>>> with your write rate.
>>>
>>> If you indeed have SSDs for data storage, increase your compaction
>>> throughput to 100 or 200 (depending on how the CPUs handle the load). You
>>> can experiment with compaction throughput using : nodetool
>>> setcompactionthroughput 100
>>>
>>> You can raise the number of concurrent compactors as well and set it to
>>> a value between 4 and 6 if you have at least 8 cores and CPUs aren't
>>> overwhelmed.
>>>
>>> I'm not sure why you ended up with only one node having 6k SSTables and
>>> not the others, but you should apply the above changes so that you can
>>> lower the number of pending compactions and see if it prevents the issue
>>> from happening again.
>>>
>>> Cheers,
>>>
>>>
>>> On Thu, Apr 5, 2018 at 11:33 AM Dmitry Simonov 
>>> wrote:
>>>
 Hi, Alexander!

 SizeTieredCompactionStrategy is used for all CFs in problematic
 keyspace.
 Current compaction throughput is 16 MB/s (default value).

 We always have about 40 pending and 2 active "CompactionExecutor" tasks
 in "tpstats".
 Mostly because of another (bigger) keyspace in this cluster.
 But the situation is the same on each node.

 According to "nodetool compactionhistory", compactions on this CF run
 (sometimes several times per day, sometimes one time per day, the last run
 was yesterday).
 We run "repair -full" regulary for this keyspace (every 24 hours on
 each node), because gc_grace_seconds is set to 24 hours.

 Should we consider increasing compaction throughput and
 "concurrent_compactors" (as recommended for SSDs) to keep
 "CompactionExecutor" pending tasks low?

 2018-04-05 14:09 GMT+05:00 Alexander Dejanovski :

> Hi Dmitry,
>
> could you tell us which compaction strategy that table is currently
> using ?
> Also, what is the compaction max throughput and is auto-compaction
> correctly enabled on that node ?
>
> Did you recently run repair ?
>
> Thanks,
>
> On Thu, Apr 5, 2018 at 10:53 AM Dmitry Simonov 
> wrote:
>
>> Hello!
>>
>> Could you please give some ideas on the following problem?
>>
>> We have a cluster with 3 nodes, running Cassandra 2.2.11.
>>
>> We've recently discovered high CPU usage on one cluster node, after
>> some investigation we found that number of sstables for one CF on it is
>> very big: 5800 sstables, on other nodes: 3 sstable.
>>
>> Data size in this keyspace was not very big ~100-200Mb per node.
>>
>> There is no such problem with other CFs of that keyspace.
>>
>> nodetool compact solved the issue as a quick-fix.
>>
>> But I'm wondering, what was the cause? How prevent it from repeating?
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



 --
 Best Regards,
 Dmitry Simonov

>>> --
>>> -
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
>>
>>
>
>
> --
> Best Regards,
> Dmitry

Re: Shifting data to DCOS

2018-04-06 Thread kurt greaves

Without looking at the code I'd say maybe the keyspaces are displayed
purely because the directories exist (but it seems unlikely). The process
you should follow instead is to exclude the system keyspaces for each node
and manually apply your schema, then upload your CFs into the correct
directory. Note this only works when RF=#nodes, if you have more nodes you
need to take tokens into account when restoring.

On Fri., 6 Apr. 2018, 17:16 Affan Syed,  wrote:

> Michael,
>
> both of the folders are with hash, so I dont think that would be an issue.
>
> What is strange is why the tables dont show up if the keyspaces are
> visible. Shouldnt that be a meta data that can be edited once and then be
> visible?
>
> Affan
>
> - Affan
>
> On Thu, Apr 5, 2018 at 7:55 PM, Michael Shuler 
> wrote:
>
>> On 04/05/2018 09:04 AM, Faraz Mateen wrote:
>> >
>> > For example,  if the table is *data_main_bim_dn_10*, its data directory
>> > is named data_main_bim_dn_10-a73202c02bf311e8b5106b13f463f8b9. I created
>> > a new table with the same name through cqlsh. This resulted in creation
>> > of another directory with a different hash i.e.
>> > data_main_bim_dn_10-c146e8d038c611e8b48cb7bc120612c9. I copied all data
>> > from the former to the latter.
>> >
>> > Then I ran *"nodetool refresh ks1  data_main_bim_dn_10"*. After that I
>> > was able to access all data contents through cqlsh.
>> >
>> > Now, the problem is, I have around 500 tables and the method I mentioned
>> > above is quite cumbersome. Bulkloading through sstableloader or remote
>> > seeding are also a couple of options but they will take a lot of time.
>> > Does anyone know an easier way to shift all my data to new setup on
>> DC/OS?
>>
>> For upgrade support from older versions of C* that did not have the hash
>> on the data directory, the table data dir can be just
>> `data_main_bim_dn_10` without the appended hash, as in your example.
>>
>> Give that a quick test to see if that simplifies things for you.
>>
>> --
>> Kind regards,
>> Michael
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>

Re: auto_bootstrap for seed node

2018-04-03 Thread kurt greaves

Setting auto_bootstrap on seed nodes is unnecessary and irrelevant. If the
node is a seed it will ignore auto_bootstrap and it *will not* bootstrap.

On 28 March 2018 at 15:49, Ali Hubail  wrote:

> "it seems that we still need to keep bootstrap false?"
>
> Could you shed some light on what would happen if the auto_bootstrap is
> removed (or set to true as the default value) in the seed nodes of the
> newly added DC?
>
> What do you have in the seeds param of the new DC nodes (cassandra.yaml)?
> Do you reference the old DC seed nodes there as well?
>
> *Ali Hubail*
>
> Email: ali.hub...@petrolink.com | www.petrolink.com
> Confidentiality warning: This message and any attachments are intended
> only for the persons to whom this message is addressed, are confidential,
> and may be privileged. If you are not the intended recipient, you are
> hereby notified that any review, retransmission, conversion to hard copy,
> copying, modification, circulation or other use of this message and any
> attachments is strictly prohibited. If you receive this message in error,
> please notify the sender immediately by return email, and delete this
> message and any attachments from your system. Petrolink International
> Limited its subsidiaries, holding companies and affiliates disclaims all
> responsibility from and accepts no liability whatsoever for the
> consequences of any unauthorized person acting, or refraining from acting,
> on any information contained in this message. For security purposes, staff
> training, to assist in resolving complaints and to improve our customer
> service, email communications may be monitored and telephone calls may be
> recorded.
>
>
> *"Peng Xiao" <2535...@qq.com <2535...@qq.com>>*
>
> 03/28/2018 12:54 AM
> Please respond to
> user@cassandra.apache.org
>
> To
> "user" ,
>
> cc
> Subject
> Re:  auto_bootstrap for seed node
>
>
>
>
> We followed this https://docs.datastax.com/en/cassandra/2.1/cassandra/
> operations/ops_add_dc_to_cluster_t.html,
> but it does not mention that change bootstrap for seed nodes after the
> rebuild.
>
> Thanks,
> Peng Xiao
>
>
> -- Original --
> *From: * "Ali Hubail";
> *Date: * Wed, Mar 28, 2018 10:48 AM
> *To: * "user";
> *Subject: * Re: auto_bootstrap for seed node
>
> You might want to follow DataStax docs on this one:
>
> For adding a DC to an existing cluster:
> *https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/operations/opsAddDCToCluster.html*
> 
> For adding a new node to an existing cluster:
> *https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/operations/opsAddNodeToCluster.html*
> 
>
> briefly speaking,
> adding one node to an existing cluster --> use auto_bootstrap
> adding a DC to an existing cluster --> rebuild
>
> You need to check the version of c* that you're running, and make sure you
> pick the right doc version for that.
>
> Most of my colleagues miss very important steps while adding/removing
> nodes/cluster, but if they stick to the docs, they always get it done right.
>
> Hope this helps
>
> * Ali Hubail*
>
> Confidentiality warning: This message and any attachments are intended
> only for the persons to whom this message is addressed, are confidential,
> and may be privileged. If you are not the intended recipient, you are
> hereby notified that any review, retransmission, conversion to hard copy,
> copying, modification, circulation or other use of this message and any
> attachments is strictly prohibited. If you receive this message in error,
> please notify the sender immediately by return email, and delete this
> message and any attachments from your system. Petrolink International
> Limited its subsidiaries, holding companies and affiliates disclaims all
> responsibility from and accepts no liability whatsoever for the
> consequences of any unauthorized person acting, or refraining from acting,
> on any information contained in this message. For security purposes, staff
> training, to assist in resolving complaints and to improve our customer
> service, email communications may be monitored and telephone calls may be
> recorded.
>
> *"Peng Xiao" <2535...@qq.com <2535...@qq.com>>*
>
> 03/27/2018 09:39 PM
>
>
> Please respond to
> user@cassandra.apache.org
>
> To
> "user" ,
>
> cc
> Subject
> auto_bootstrap for seed node
>
>
>
>
>
>
> Dear All,
>
> For adding a new DC ,we need to set auto_bootstrap: false and then run the
> rebuild,finally we need to change auto_bootstrap: true,but for seed
> nodes,it seems that we still need to keep bootstrap false?
> Could anyone please confirm?
>
> Thanks,
> Peng Xiao
>

Re: Execute an external program

2018-04-03 Thread kurt greaves

Correct. Note that both triggers and CDC aren't widely used yet so be sure
to test.

On 28 March 2018 at 13:02, Earl Lapus  wrote:

>
> On Wed, Mar 28, 2018 at 8:39 AM, Jeff Jirsa  wrote:
>
>> CDC may also work for newer versions, but it’ll happen after the mutation
>> is applied
>>
>> --
>> Jeff Jirsa
>>
>>
> "after the mutation is applied" means after the query is executed?
>
>

Re: replace dead node vs remove node

2018-03-25 Thread kurt greaves

Didn't read the blog but it's worth noting that if you replace the node and
give it a *different* ip address repairs will not be necessary as it will
receive writes during replacement. This works as long as you start up the
replacement node before HH window ends.

https://issues.apache.org/jira/browse/CASSANDRA-12344 and
https://issues.apache.org/jira/browse/CASSANDRA-11559 fixes this for same
address replacements (hopefully in 4.0)

On Fri., 23 Mar. 2018, 15:11 Anthony Grasso, 
wrote:

> Hi Peng,
>
> Correct, you would want to repair in either case.
>
> Regards,
> Anthony
>
>
> On Fri, 23 Mar 2018 at 14:09, Peng Xiao <2535...@qq.com> wrote:
>
>> Hi Anthony,
>>
>> there is a problem with replacing dead node as per the blog,if the
>> replacement process takes longer than max_hint_window_in_ms,we must run
>> repair to make the replaced node consistent again, since it missed ongoing
>> writes during bootstrapping.but for a great cluster,repair is a painful
>> process.
>>
>> Thanks,
>> Peng Xiao
>>
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Anthony Grasso";
>> *发送时间:* 2018年3月22日(星期四) 晚上7:13
>> *收件人:* "user";
>> *主题:* Re: replace dead node vs remove node
>>
>> Hi Peng,
>>
>> Depending on the hardware failure you can do one of two things:
>>
>> 1. If the disks are intact and uncorrupted you could just use the disks
>> with the current data on them in the new node. Even if the IP address
>> changes for the new node that is fine. In that case all you need to do is
>> run repair on the new node. The repair will fix any writes the node missed
>> while it was down. This process is similar to the scenario in this blog
>> post:
>> http://thelastpickle.com/blog/2018/02/21/replace-node-without-bootstrapping.html
>>
>> 2. If the disks are inaccessible or corrupted, then use the method as
>> described in the blogpost you linked to. The operation is similar to
>> bootstrapping a new node. There is no need to perform any other remove or
>> join operation on the failed or new nodes. As per the blog post, you
>> definitely want to run repair on the new node as soon as it joins the
>> cluster. In this case here, the data on the failed node is effectively lost
>> and replaced with data from other nodes in the cluster.
>>
>> Hope this helps.
>>
>> Regards,
>> Anthony
>>
>>
>> On Thu, 22 Mar 2018 at 20:52, Peng Xiao <2535...@qq.com> wrote:
>>
>>> Dear All,
>>>
>>> when one node failure with hardware errors,it will be in DN status in
>>> the cluster.Then if we are not able to handle this error in three hours(max
>>> hints window),we will loss data,right?we have to run repair to keep the
>>> consistency.
>>> And as per
>>> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html,we
>>> can replace this dead node,is it the same as bootstrap new node?that means
>>> we don't need to remove node and rejoin?
>>> Could anyone please advise?
>>>
>>> Thanks,
>>> Peng Xiao
>>>
>>>
>>>
>>>
>>>

Re: Nodetool Repair --full

2018-03-18 Thread kurt greaves

Worth noting that if you have racks == RF you only need to repair one rack
to repair all the data in the cluster if you *don't* use -pr. Also note
that full repairs on >=3.0 case anti-compactions and will mark things as
repaired, so once you start repairs you need to keep repairing to ensure
you don't have any zombie data or other problems.

On 17 March 2018 at 15:52, Hannu Kröger  wrote:

> Hi Jonathan,
>
> If you want to repair just one node (for example if it has been down for
> more than 3h), run “nodetool repair -full” on that node. This will bring
> all data on that node up to date.
>
> If you want to repair all data on the cluster, run “nodetool repair -full
> -pr” on each node. This will run full repair on all nodes but it will do it
> so only the primary range for each node is fixed. If you do it on all
> nodes, effectively the whole token range is repaired. You can run the same
> without -pr to get the same effect but it’s not efficient because then you
> are doing the repair RF times on all data instead of just repairing the
> whole data once.
>
> I hope this clarifies,
> Hannu
>
> On 17 Mar 2018, at 17:20, Jonathan Baynes 
> wrote:
>
> Hi Community,
>
> Can someone confirm, as the documentation out on the web is so
> contradictory and vague.
>
> Nodetool repair –full if I call this, do I need to run this on ALL my
> nodes or is just the once sufficient?
>
> Thanks
> J
>
> *Jonathan Baynes*
> DBA
> Tradeweb Europe Limited
> Moor Place  •  1 Fore Street Avenue
> 
>   •
> 
>   London EC2Y 9DT
> 
> P +44 (0)20 77760988 <+44%2020%207776%200988>  •  F +44 (0)20 7776 3201
> <+44%2020%207776%203201>  •  M +44 (0)7884111546 <+44%207884%20111546>
> jonathan.bay...@tradeweb.com
>
>     follow us:  **
>    <
> image003.jpg> 
> —
> A leading marketplace  for
> electronic fixed income, derivatives and ETF trading
>
>
> 
>
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and destroy it. Any unauthorized
> copying, disclosure or distribution of the material in this e-mail is
> strictly forbidden. Tradeweb reserves the right to monitor all e-mail
> communications through its networks. If you do not wish to receive
> marketing emails about our products / services, please let us know by
> contacting us, either by email at contac...@tradeweb.com or by writing to
> us at the registered office of Tradeweb in the UK, which is: Tradeweb
> Europe Limited (company number 3912826), 1 Fore Street Avenue London EC2Y
> 9DT
> .
> To see our privacy policy, visit our website @ www.tradeweb.com.
>
>
>

Re: Best way to Drop Tombstones/after GC Grace

2018-03-14 Thread kurt greaves

At least set GCGS == max_hint_window_in_ms that way you don't effectively
disable hints for the table while your compaction is running. Might be
preferable to use nodetool garbagecollect if you don't have enough disk
space for a major compaction. Also worth noting you should do a splitting
major compaction so you don't end up with one big SSTable when using STCS
(also applicable for LCS)

On 14 March 2018 at 18:53, Jeff Jirsa  wrote:

> Can’t advise that without knowing the risk to your app if there’s data
> resurrected
>
>
> If there’s no risk, then sure - set gcgs to 0 and force / major compact if
> you have the room
>
>
>
> --
> Jeff Jirsa
>
>
> On Mar 14, 2018, at 11:47 AM, Madhu-Nosql  wrote:
>
> Jeff,
>
> Thank you i got this- how about Dropping the existing Tombstones right now
> can setting gc_grace time to zero per Table level would be good or what
> would you suggest?
>
> On Wed, Mar 14, 2018 at 1:41 PM, Jeff Jirsa  wrote:
>
>> What version of Cassandra?
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-7304 sort of addresses
>> this in 2.2+
>>
>>
>>
>>
>> On Wed, Mar 14, 2018 at 11:32 AM, Madhu-Nosql 
>> wrote:
>>
>>> Rahul,
>>>
>>> Tomstone caused is on the Application driver side so even though they
>>> are not using some of the Columns in their logic
>>> waht they did is that they mentioned in driver logic that means if you
>>> are updateting one Column so the rest of the Columns so the driver
>>> automatically
>>> pick some nulls, internally behind the schnes cassandra threat them as a
>>> Tombstones
>>>
>>> On Wed, Mar 14, 2018 at 12:58 PM, Rahul Singh <
>>> rahul.xavier.si...@gmail.com> wrote:
>>>
 Then don’t write nulls. That’s the root of the issue. Sometimes they
 surface from prepared statements. Othertimes they come because of default
 null values in objects.

 --
 Rahul Singh
 rahul.si...@anant.us

 Anant Corporation

 On Mar 13, 2018, 2:18 PM -0400, Madhu-Nosql ,
 wrote:

 We assume that's becoz of nulls

 On Tue, Mar 13, 2018 at 12:58 PM, Rahul Singh <
 rahul.xavier.si...@gmail.com> wrote:

> Are you writing nulls or does the data cycle that way?
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Mar 13, 2018, 11:48 AM -0400, Madhu-Nosql ,
> wrote:
>
> Rahul,
>
> Nodetool scrub is good for rescue, what if its happening all the time?
>
> On Tue, Mar 13, 2018 at 10:37 AM, Rahul Singh <
> rahul.xavier.si...@gmail.com> wrote:
>
>> Do you anticipate this happening all the time or are you just trying
>> to rescue?
>>
>> Nodetool scrub can be useful too.
>>
>>
>> --
>> Rahul Singh
>> rahul.si...@anant.us
>>
>> Anant Corporation
>>
>> On Mar 13, 2018, 11:29 AM -0400, Madhu-Nosql ,
>> wrote:
>>
>> I got few ways to Drop Tombstones- Chos Monkey/Zombie Data mainly to
>> avoid Data Resurrection (you deleted data it will comes back in
>> future)
>>
>> I am thinking of below options, let me know if you have any best
>> practice for this
>>
>> 1.using nodetool garbagecollect
>> 2.only_purge_repaired_tombstones
>> 3.At Table level making GC_Grace_period to zero and compact
>>
>> Thanks,
>> Madhu
>>
>>
>

>>>
>>
>

Re: What versions should the documentation support now?

2018-03-13 Thread kurt greaves

>
> I’ve never heard of anyone shipping docs for multiple versions, I don’t
> know why we’d do that.  You can get the docs for any version you need by
> downloading C*, the docs are included.  I’m a firm -1 on changing that
> process.

We should still host versioned docs on the website however. Either that or
we specify "since version x" for each component in the docs with notes on
behaviour.

Re: Removing initial_token parameter

2018-03-09 Thread kurt greaves

correct, tokens will be stored in the nodes system tables after the first
boot, so feel free to remove them (although it's not really necessary)

On 9 Mar. 2018 20:16, "Mikhail Tsaplin"  wrote:

> Is it safe to remove initial_token parameter on a cluster created by
> snapshot restore procedure presented here https://docs.datastax.com
> /en/cassandra/latest/cassandra/operations/opsSnapshotRestore
> NewCluster.html  ?
>
> For me, it seems that initial_token parameter is used only when nodes are
> started the first time and later during next reboot Cassandra obtains
> tokens from internal structures and initital_token parameter absence would
> not affect it.
>
>

Re: Cassandra/Spark failing to process large table

2018-03-08 Thread kurt greaves

Note that read repairs only occur for QUORUM/equivalent and higher, and
also with a 10% (default) chance on anything less than QUORUM
(ONE/LOCAL_ONE). This is configured at the table level through the
dclocal_read_repair_chance and read_repair_chance settings (which are going
away in 4.0). So if you read at LOCAL_ONE it would have been chance that
caused the read repair. Don't expect it to happen for every read (unless
you configure it to, or use >=QUORUM).

Re: One time major deletion/purge vs periodic deletion

2018-03-07 Thread kurt greaves

The important point to consider is whether you are deleting old data or
recently written data. How old/recent depends on your write rate to the
cluster and there's no real formula. Basically you want to avoid deleting a
lot of old data all at once because the tombstones will end up in new
SSTables and the data to be deleted will live in higher levels (LCS) or
large SSTables (STCS), which won't get compacted together for a long time.
In this case it makes no difference if you do a big purge or if you break
it up, because at the end of the day if your big purge is just old data,
all the tombstones will have to stick around for awhile until they make it
to the higher levels/bigger SSTables.

If you have to purge large amounts of old data, the easiest way is to 1.
Make sure you have at least 50% disk free (for large/major compactions)
and/or 2. Use garbagecollect compactions (3.10+)

Re: Cassandra 2.1.18 - Concurrent nodetool repair resulting in > 30K SSTables for a single small (GBytes) CF

2018-03-06 Thread kurt greaves

>
>  What we did have was some sort of overlapping between our daily repair
> cronjob and the newly added node still in progress joining. Don’t know if
> this sort of combination might causing troubles.

I wouldn't be surprised if this caused problems. Probably want to avoid
that.

with waiting a few minutes after each finished execution and every time I
> see “… out of sync …” log messages in context of the repair, so it looks
> like, that each repair execution is detecting inconsistencies. Does this
> make sense?

Well it doesn't, but there have been issues in the past that caused exactly
this problem. I was under the impression they were all fixed by 2.1.18
though.

Additionally, we are writing at CL ANY, reading at ONE and repair chance
> for the 2 CFs in question is default 0.1

Have you considered writing at least at CL [LOCAL_]ONE? At the very least
it would rule out if there's a problem with hints.

Re: Cassandra 2.1.18 - Concurrent nodetool repair resulting in > 30K SSTables for a single small (GBytes) CF

2018-03-04 Thread kurt greaves

Repairs with vnodes is likely to cause a lot of small SSTables if you have
inconsistencies (at least 1 per vnode). Did you have any issues when adding
nodes, or did you add multiple nodes at a time? Anything that could have
lead to a bit of inconsistency could have been the cause.

I'd probably avoid running the repairs across all the nodes simultaneously
and instead spread them out over a week. That likely made it worse. Also
worth noting that in versions 3.0+ you won't be able to run nodetool repair
in such a way because anti-compaction will be triggered which will fail if
multiple anti-compactions are attempted simultaneously (if you run multiple
repairs simultaneously).

Have a look at orchestrating your repairs with TLP's fork of
cassandra-reaper .

Re: Right sizing Cassandra data nodes

2018-02-28 Thread kurt greaves

The problem with higher densities is operations, not querying. When you
need to add nodes/repair/do any streaming operation having more than 3TB
per node becomes more difficult. It's certainly doable, but you'll probably
run into issues. Having said that, an insert only workload is the best
candidate for higher densities.

I'll note that you don't need to bucket by partition really, if you can use
clustering keys (e.g a timestamp) Cassandra will be smart enough to only
read from the SSTables that contain the relevant rows.

But to answer your question, all data is active data. There is no inactive
data. If all you query is the past two months, that's the only data that
will be read by Cassandra. It won't go and read old data unless you tell it
to.

On 24 February 2018 at 07:02, onmstester onmstester 
wrote:

> Another Question on node density, in this scenario:
> 1. we should keep time series data of some years for a heavy write system
> in Cassandra (> 10K Ops in seconds)
> 2. the system is insert only and inserted data would never be updated
> 3. in partition key, we used number of months since 1970, so data for
> every month would be on separate partitions
> 4. because of rule 2, after the end of month previous partitions would
> never be accessed for write requests
> 5. more than 90% of read requests would concern current month partitions,
> so we merely access Old data, we should just keep them for that 10% of
> reports!
> 6. The overall read in comparison to writes are so small (like 0.0001 % of
> overall time)
>
> So, finally the question:
> Even in this scenario would the active data be the whole data (this month
> + all previous months)? or the one which would be accessed for most reads
> and writes (only the past two months)?
> Could i use more than 3TB  per node for this scenario?
>
> Sent using Zoho Mail 
>
>
>  On Tue, 20 Feb 2018 14:58:39 +0330 *Rahul Singh
> >* wrote 
>
> Node density is active data managed in the cluster divided by the number
> of active nodes. Eg. If you you have 500TB or active data under management
> then you would need 250-500 nodes to get beast like optimum performance. It
> also depends on how much memory is on the boxes and if you are using SSD
> drives. SSD doesn’t replace memory but it doesn’t hurt.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Feb 19, 2018, 5:55 PM -0500, Charulata Sharma (charshar) <
> chars...@cisco.com>, wrote:
>
> Thanks for the response Rahul. I did not understand the “node density”
> point.
>
>
>
> Charu
>
>
>
> *From:* Rahul Singh 
> *Reply-To:* "user@cassandra.apache.org" 
> *Date:* Monday, February 19, 2018 at 12:32 PM
> *To:* "user@cassandra.apache.org" 
> *Subject:* Re: Right sizing Cassandra data nodes
>
>
>
> 1. I would keep opscenter on different cluster. Why unnecessarily put
> traffic and computing for opscenter data on a real business data cluster?
> 2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it
> increases creates more replication, read repairs , etc and memory usage for
> doing the compactions etc.
> 3. Can have as much as you want for snapshots as long as you have it on
> another disk or even move it to a SAN / NAS. All you may care about us the
> most recent snapshot on the physical machine / disks on a live node.
>
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
>
> On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) <
> chars...@cisco.com>, wrote:
>
> Hi All,
>
>
>
> Looking for some insight into how application data archive and purge is
> carried out for C* database. Are there standard guidelines on calculating
> the amount of space that can be used for storing data in a specific node.
>
>
>
> Some pointers that I got while researching are;
>
>
>
> -  Allocate 50% space for compaction, e.g. if data size is 50GB
> then allocate 25GB for compaction.
>
> -  Snapshot strategy. If old snapshots are present, then they
> occupy the disk space.
>
> -  Allocate some percentage of storage (  ) for system tables
> and OpsCenter tables ?
>
>
>
> We have a scenario where certain transaction data needs to be archived
> based on business rules and some purged, so before deciding on an A
> strategy, I am trying to analyze
>
> how much transactional data can be stored given the current node capacity.
> I also found out that the space available metric shown in Opscenter is not
> very reliable because it doesn’t show
>
> the snapshot space. In our case, we have a huge snapshot size. For some
> unexplained reason, we seem to be taking snapshots of our data every hour
> and purging them only after 7 days.
>
>
>
>
>
> Thanks,
>
> Charu
>
> Cisco Systems.
>
>
>
>
>
>
>
>
>
>

Re: The home page of Cassandra is mobile friendly but the link to the third parties is not

2018-02-28 Thread kurt greaves

Already addressed in CASSANDRA-14128
, however waiting on
review/comments regarding what we actually do with this page.

If you want to bring attention to JIRA's, user list is probably
appropriate. I'd avoid spamming it too much though.

On 26 February 2018 at 19:22, Kenneth Brotman 
wrote:

> The home page of Cassandra is mobile friendly but the link to the third
> parties from that web page is not.  Any suggestions?
>
>
>
> I made a JIRA for it: https://issues.apache.org/
> jira/browse/CASSANDRA-14263
>
>
>
> Should posts about JIRA’s be on this list or the dev list?
>
>
>
> Kenneth Brotman
>
>
>
>
>

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-21 Thread kurt greaves

>
> Also, I was wondering if the key cache maintains a count of how many local
> accesses a key undergoes. Such information might be very useful for
> compactions of sstables by splitting data by frequency of use so that those
> can be preferentially compacted.

No we don't currently have metrics for that, only overall cache
hits/misses. Measuring individual local accesses would probably have a
performance and memory impact but there's probably a way to do it
efficiently.

Has this been exploited... ever?

Not that I know of. I've theorised about using it previously with some
friends, but never got around to trying it. I imagine if you did you'd
probably have to fix some parts of the code to make it work (like
potentially discoverComponentsFor).

Typically I think any conversation that is relevant to the internals of
Cassandra is fine for the dev list, and that's the desired audience. Not
every dev watches the user list and only developers will really be able to
answer these questions. Lets face it, the dev list is pretty dead so not
sure why we care about a few emails landing there.

Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread kurt greaves

>
> Instead of saying "Make X better" you can quantify "Here's how we can make
> X better" in a jira and the conversation will continue with interested
> parties (opening jiras are free!). Being combative and insulting project on
> mailing list may help vent some frustrations but it is counter productive
> and makes people defensive.

Yep. In the Cassandra project you'll have a very hard time convincing
someone else (under someone elses pay) to work on what you want even if you
approach it in the right way. Being assertive/aggressive is sure to remove
all chances entirely.
OSS for such large projects as Cassandra only works if we have a variety of
perspectives all working on the project together, as it's not very feasible
for volunteers to get into the C* project on their own time (nor will it
ever be). At the moment we don't have enough different perspectives working
on the project and the only way to improve that is get involved (preferably
writing some code).

I have to disagree with people here and point out that just creating JIRA's
and (trying to) have discussions about these issues will not lead to change
in any reasonable timeframe, because everyone who could do the work has an
endless list of bigger fish to fry. I strongly encourage you to get
involved and write some code, or pay someone to do it, because to put it
bluntly, it's *very* unlikely your JIRA's will get actioned unless you
contribute significantly to them yourself.

Of course there are also other ways to contribute as well, but by far the
most effective would be to contribute fixes, the next most effective would
be to contribute documentation and help users on the mailing list. Your
Slender Cassandra project is a great example of this, because despite C*
being hard to administer, it would give a lot of users examples to work
off. If people can get it working properly with the right advice, usability
is not such a big issue.

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-20 Thread kurt greaves

Probably a lot of work but it would be incredibly useful for vnodes if
flushing was range aware (to be used with RangeAwareCompactionStrategy).
The writers are already range aware for JBOD, but that's not terribly
valuable ATM.

On 20 February 2018 at 21:57, Jeff Jirsa  wrote:

> There are some arguments to be made that the flush should consider
> compaction strategy - would allow a bug flush to respect LCS filesizes or
> break into smaller pieces to try to minimize range overlaps going from l0
> into l1, for example.
>
> I have no idea how much work would be involved, but may be worthwhile.
>
>
> --
> Jeff Jirsa
>
>
> On Feb 20,  2018, at 1:26 PM, Jon Haddad  wrote:
>
> The file format is independent from compaction.  A compaction strategy
> only selects sstables to be compacted, that’s it’s only job.  It could have
> side effects, like generating other files, but any decent compaction
> strategy will account for the fact that those other files don’t exist.
>
> I wrote a blog post a few months ago going over some of the nuance of
> compaction you mind find informative: http://thelastpickle.com/blog/2017/
> 03/16/compaction-nuance.html
>
> This is also the wrong mailing list, please direct future user questions
> to the user list.  The dev list is for development of Cassandra itself.
>
> Jon
>
> On Feb 20, 2018, at 1:10 PM, Carl Mueller 
> wrote:
>
> When memtables/CommitLogs are flushed to disk/sstable, does the sstable go
> through sstable organization specific to each compaction strategy, or is
> the sstable creation the same for all compactionstrats and it is up to the
> compaction strategy to recompact the sstable if desired?
>
>
>

Re: vnode random token assignment and replicated data antipatterns

2018-02-20 Thread kurt greaves

>
> Outside of rack awareness, would the next primary ranges take the replica
> ranges?

Yes.

Re: Roadmap for 4.0

2018-02-15 Thread kurt greaves

>
> I don't believe Q3/Q4 is realistic, but I may be biased (or jaded). It's
> possible Q3/Q4 alpha/beta is realistic, but definitely not a release.

Well, this mostly depends on how much stuff to include in 4.0. Either way
it's not terribly important. If people think 2019 is more realistic we can
aim for that. As I said, it's just a rough timeframe to keep in mind.

3.10 was released in January 2017, and we've got around 180 changes for 4.0
so far, and let's be honest, 3.11 is still pretty young so it's going to be
a significant effort to properly test and verify 4.0.
Let's just stick to getting a list of changes for the moment. I probably
shouldn't have mentioned timeframes, let's just keep in mind that we
shouldn't have such a large set of changes for 4.0 that it takes us years
to complete.

All that said, what I really care about is building confidence in the
> release, which means an extended testing cycle. If all of those patches
> landed tomorrow, I'd still expect us to be months away from a release,
> because we need to bake the next major - there's too many changes to throw
> out an alpha/beta/rc and hope someone actually runs it.

Yep. As I said, I'll follow up about testing after we sort out what we're
actually going to include in 4.0. No point trying to come up with a testing
plan for

On 13 February 2018 at 04:25, Jeff Jirsa <jji...@gmail.com> wrote:

>
> Advantages of cutting a release sooner than later:
> 1) The project needs to constantly progress forward. Releases are the most
> visible part of that.
> 2) Having a huge changelog in a release increases the likelihood of bugs
> that take time to find.
>
> Advantages of a slower release:
> 1) We don't do major versions often, and when we do breaking changes
> (protocol, file format, etc), we should squeeze in as many as possible to
> avoid having to roll new majors
> 2) There are probably few people actually running 3.11 at scale, so
> probably few people actually testing trunk.
>
> In terms of "big" changes I'd like to see land, the ones that come to mind
> are:
>
> https://issues.apache.org/jira/browse/CASSANDRA-9754 - "Birch" (changes
> file format)
> https://issues.apache.org/jira/browse/CASSANDRA-13442 - Transient
> Replicas (probably adds new replication strategy or similar)
> https://issues.apache.org/jira/browse/CASSANDRA-13628 - Rest of the
> internode netty stuff (no idea if this changes internode stuff, but I bet
> it's a lot easier if it lands on a major)
> https://issues.apache.org/jira/browse/CASSANDRA-7622 - Virtual Tables
> (selfish inclusion, probably doesn't need to be a major at all, and I
> wouldn't even lose sleep if it slips, but I'd like to see it land)
>
> Stuff I'm ok with slipping to 4.X or 5.0, but probably needs to land on a
> major because we'll change something big (like gossip, or the way schema is
> passed, etc):
>
> https://issues.apache.org/jira/browse/CASSANDRA-9667 - Strongly
> consistent membership
> https://issues.apache.org/jira/browse/CASSANDRA-10699 - Strongly
> consistent schema
>
> All that said, what I really care about is building confidence in the
> release, which means an extended testing cycle. If all of those patches
> landed tomorrow, I'd still expect us to be months away from a release,
> because we need to bake the next major - there's too many changes to throw
> out an alpha/beta/rc and hope someone actually runs it.
>
> I don't believe Q3/Q4 is realistic, but I may be biased (or jaded). It's
> possible Q3/Q4 alpha/beta is realistic, but definitely not a release.
>
>
>
>
> On Sun, Feb 11, 2018 at 8:29 PM, kurt greaves <k...@instaclustr.com>
> wrote:
>
>> Hi friends,
>> *TL;DR: Making a plan for 4.0, ideally everyone interested should provide
>> up to two lists, one for tickets they can contribute resources to getting
>> finished, and one for features they think would be desirable for 4.0, but
>> not necessarily have the resources to commit to helping with.*
>>
>> So we had that Roadmap for 4.0 discussion last year, but there was never
>> a conclusion or a plan that came from it. Times getting on and the changes
>> list for 4.0 is getting pretty big. I'm thinking it would probably make
>> sense to define some goals to getting 4.0 released/have an actual plan. 4.0
>> is already going to be quite an unwieldy release with a lot of testing
>> required.
>>
>> Note: the following is open to discussion, if people don't like the plan
>> feel free to speak up. But in the end it's a pretty basic plan and I don't
>> think we should over-complicate it, I also don't want to end up in a
>> discussion where we "make a plan to make a plan". Regardless of whatever
>> plan we do end up f

Re: Rapid scaleup of cassandra nodes with snapshots and initial_token in the yaml

2018-02-15 Thread kurt greaves

Ben did a talk

that might have some useful information. It's much more complicated with
vnodes though and I doubt you'll be able to get it to be as rapid as you'd
want.

sets up schema to match

This shouldn't be necessary. You'd just join the node as usual but with
auto_bootstrap: false and let the schema be propagated.

Is there an issue if the vnodes tokens for two nodes are identical? Do they
> have to be distinct for each node?

Yeah. This is annoying I know. The new node will take over the tokens of
the old node, which you don't want.


> Basically, I was wondering if we just use this to double the number of
> nodes with identical copies of the node data via snapshots, and then later
> on cassandra can pare down which nodes own which data.

There wouldn't be much point to adding nodes with the same (or almost the
same) tokens. That would just be shifting load. You'd essentially need a
very smart allocation algorithm to come up with good token ranges, but then
you still have the problem of tracking down the relevant SSTables from the
nodes. Basically, bootstrap does this for you ATM and only streams the
relevant sections of SSTables for the new node. If you were doing it from
backups/snapshots you'd need to either do the same thing (eek) or copy all
the SSTables from all the relevant nodes.

With single token nodes this becomes much easier. You can likely get away
with only copying around double/triple the data (depending on how you add
tokens to the ring and RF and node count).

I'll just put it out there that C* is a database and really isn't designed
to be rapidly scalable. If you're going to try, be prepared to invest A LOT
of time into it.

Re: node restart causes application latency

2018-02-12 Thread kurt greaves

Actually, it's not really clear to me why disablebinary and thrift are
necessary prior to drain, because they happen in the same order during
drain anyway. It also really doesn't make sense that disabling gossip after
drain would make a difference here, because it should be already stopped.
This is all assuming drain isn't erroring out.

Re: node restart causes application latency

2018-02-12 Thread kurt greaves

Drain will take care of stopping gossip, and does a few tasks before
stopping gossip (stops batchlog, hints, auth, cache saver and a few other
things). I'm not sure why this causes a side effect when you restart the
node, but there should be no need to issue a disablegossip anyway, just
leave that to the drain. As Jeff said, we need to fix drain because this
chain of commands should be unnecessary.

On 12 February 2018 at 18:36, Mike Torra  wrote:

> Interestingly, it seems that changing the order of steps I take during the
> node restart resolves the problem. Instead of:
>
> `nodetool disablebinary && nodetool disablethrift && *nodetool
> disablegossip* && nodetool drain && sudo service cassandra restart`,
>
> if I do:
>
> `nodetool disablebinary && nodetool disablethrift && nodetool drain && 
> *nodetool
> disablegossip* && sudo service cassandra restart`,
>
> I see no application errors, no latency, and no nodes marked as
> Down/Normal on the restarted node. Note the only thing I changed is that I
> moved `nodetool disablegossip` to after `nodetool drain`. This is pretty
> anecdotal, but is there any explanation for why this might happen? I'll be
> monitoring my cluster closely to see if this change does indeed fix the
> problem.
>
> On Mon, Feb 12, 2018 at 9:33 AM, Mike Torra  wrote:
>
>> Any other ideas? If I simply stop the node, there is no latency problem,
>> but once I start the node the problem appears. This happens consistently
>> for all nodes in the cluster
>>
>> On Wed, Feb 7, 2018 at 11:36 AM, Mike Torra 
>> wrote:
>>
>>> No, I am not
>>>
>>> On Wed, Feb 7, 2018 at 11:35 AM, Jeff Jirsa  wrote:
>>>
 Are you using internode ssl?

 --
 Jeff Jirsa

 On Feb 7, 2018, at 8:24 AM, Mike Torra  wrote:

 Thanks for the feedback guys. That example data model was indeed
 abbreviated - the real queries have the partition key in them. I am using
 RF 3 on the keyspace, so I don't think a node being down would mean the key
 I'm looking for would be unavailable. The load balancing policy of the
 driver seems correct (https://docs.datastax.com/en/
 developer/nodejs-driver/3.4/features/tuning-policies/#load-b
 alancing-policy, and I am using the default `TokenAware` policy with
 `DCAwareRoundRobinPolicy` as a child), but I will look more closely at the
 implementation.

 It was an oversight of mine to not include `nodetool disablebinary`,
 but I still experience the same issue with that.

 One other thing I've noticed is that after restarting a node and seeing
 application latency, I also see that the node I just restarted sees many
 other nodes in the same DC as being down (ie status 'DN'). However,
 checking `nodetool status` on those other nodes shows all nodes as
 up/normal. To me this could kind of explain the problem - node comes back
 online, thinks it is healthy but many others are not, so it gets traffic
 from the client application. But then it gets requests for ranges that
 belong to a node it thinks is down, so it responds with an error. The
 latency issue seems to start roughly when the node goes down, but persists
 long (ie 15-20 mins) after it is back online and accepting connections. It
 seems to go away once the bounced node shows the other nodes in the same DC
 as up again.

 As for speculative retry, my CF is using the default of '99th
 percentile'. I could try something different there, but nodes being seen as
 down seems like an issue.

 On Tue, Feb 6, 2018 at 6:26 PM, Jeff Jirsa  wrote:

> Unless you abbreviated, your data model is questionable (SELECT
> without any equality in the WHERE clause on the partition key will always
> cause a range scan, which is super inefficient). Since you're doing
> LOCAL_ONE and a range scan, timeouts sorta make sense - the owner of at
> least one range would be down for a bit.
>
> If you actually have a partition key in your where clause, then the
> next most likely guess is your clients aren't smart enough to route around
> the node as it restarts, or your key cache is getting cold during the
> bounce. Double check your driver's load balancing policy.
>
> It's also likely the case that speculative retry may help other nodes
> route around the bouncing instance better - if you're not using it, you
> probably should be (though with CL: LOCAL_ONE, it seems like it'd be less
> of an issue).
>
> We need to make bouncing nodes easier (or rather, we need to make
> drain do the right thing), but in this case, your data model looks like 
> the
> biggest culprit (unless it's an incomplete recreation).
>
> - Jeff
>
>
> On Tue, Feb 6, 2018 at 10:58 AM, Mike Torra

Roadmap for 4.0

2018-02-11 Thread kurt greaves

Hi friends,
*TL;DR: Making a plan for 4.0, ideally everyone interested should provide
up to two lists, one for tickets they can contribute resources to getting
finished, and one for features they think would be desirable for 4.0, but
not necessarily have the resources to commit to helping with.*

So we had that Roadmap for 4.0 discussion last year, but there was never a
conclusion or a plan that came from it. Times getting on and the changes
list for 4.0 is getting pretty big. I'm thinking it would probably make
sense to define some goals to getting 4.0 released/have an actual plan. 4.0
is already going to be quite an unwieldy release with a lot of testing
required.

Note: the following is open to discussion, if people don't like the plan
feel free to speak up. But in the end it's a pretty basic plan and I don't
think we should over-complicate it, I also don't want to end up in a
discussion where we "make a plan to make a plan". Regardless of whatever
plan we do end up following it would still be valuable to have a list of
tickets for 4.0 which is the overall goal of this email - so let's not get
too worked up on the details just yet (save that for after I
summarise/follow up).

// TODO
I think the best way to go about this would be for us to come up with a
list of JIRA's that we want included in 4.0, tag these as 4.0, and all
other improvements as 4.x. We can then aim to release 4.0 once all the 4.0
tagged tickets (+bug fixes/blockers) are complete.

Now, the catch is that we obviously don't want to include too many tickets
in 4.0, but at the same time we want to make sure 4.0 has an appealing
feature set for both users/operators/developers. To minimise scope creep I
think the following strategy will help:

We should maintain two lists:

   1. JIRA's that people want in 4.0 and can commit resources to getting
   them implemented in 4.0.
   2. JIRA's that people simply think would be desirable for 4.0, but
   currently don't have anyone assigned to them or planned assignment. It
   would probably make sense to label these with an additional tag in
JIRA. *(User's
   please feel free to point out what you want here)*

>From list 1 will come our source of truth for when we release 4.0. (after
aggregating a list I will summarise and we can vote on it).

List 2 would be the "hopeful" list, where stories can be picked up from if
resourcing allows, or where someone comes along and decides it's good
enough to work on. I guess we can also base this on a vote system if we
reach the point of including some of them. (but for the moment it's purely
to get an idea of what users actually want).

Please don't refrain from listing something that's already been mentioned.
The purpose is to get an idea of everyone's priorities/interests and the
resources available. We will need multiple resources for each ticket, so
anywhere we share an interest will make for a lot easier work sharing.

Note that we are only talking about improvements here. Bugs will be treated
the same as always, and major issues/regressions we'll need to fix prior to
4.0 anyway.

TIME FRAME
Generally I think it's a bad idea to commit to any hard deadline, but we
should have some time frames in mind. My idea would be to aim for a Q3/4
2018 release, and as we go we just review the outstanding improvements and
decide whether it's worth pushing it back or if we've got enough to
release. I suppose keep this time frame in mind when choosing your tickets.

We can aim for an earlier date (midyear?) but I figure the
testing/validation/bugfixing period prior to release might drag on a bit so
being a bit conservative here.
The main goal would be to not let list 1 grow unless we're well ahead, and
only cull from it if we're heavily over-committed or we decide the
improvement can wait. I assume this all sounds like common sense but
figured it's better to spell it out now.


NEXT STEPS
After 2 weeks/whenever the discussion dies off I'll consolidate all the
tickets, relevant comments and follow up with a summary, where we can
discuss/nitpick issues and come up with a final list to go ahead with.

On a side note, in conjunction with this effort we'll obviously have to do
something about validation and testing. I'll keep that out of this email
for now, but there will be a follow up so that those of us willing to help
validate/test trunk can avoid duplicating effort.

REVIEW
This is the list of "huge/breaking" tickets that got mentioned in the last
roadmap discussion and their statuses. This is not terribly important but
just so we can keep in mind what we previously talked about. I think we
leave it up to the relevant contributors to decide whether they want to get
the still open tickets into 4.0.

CASSANDRA-9425 Immutable node-local schema
 - Committed
CASSANDRA-10699 Strongly consistent schema alterations
 - Open, no
discussion in quite some time.
CASSANDRA-12229

Re: Heavy one-off writes best practices

2018-02-04 Thread kurt greaves

>
> Would you know if there is evidence that inserting skinny rows in sorted
> order (no batching) helps C*?

This won't have any effect as each insert will be handled separately by the
coordinator (or a different coordinator, even). Sorting is also very
unlikely to help even if you did batch.

 Also, in the case of wide rows, is there evidence that sorting clustering
> keys within partition batches helps ease C*'s job?

No evidence, seems very unlikely.

Re: Nodes show different number of tokens than initially

2018-02-01 Thread kurt greaves

So one time I tried to understand why only a single node could have a
token, and it appeared that it came over the fence from facebook and has
been kept ever since. Personally I don't think it's necessary, and agree
that it is kind of problematic (but there's probably lot's of stuff that
relies on this now). Multiple DC's is one example but the same could apply
to racks. There's no real reason (with NTS) that two nodes in separate
racks can't have the same token. In fact being able to do this would make
token allocation much simpler, and smart allocation algorithms could work
much better with vnodes.

On 1 February 2018 at 17:35, Oleksandr Shulgin  wrote:

> On Thu, Feb 1, 2018 at 5:19 AM, Jeff Jirsa  wrote:
>
>>
>>> The reason I find it surprising, is that it makes very little *sense* to
>>> put a token belonging to a mode from one DC between tokens of nodes from
>>> another one.
>>>
>>
>> I don't want to really turn this into an argument over what should and
>> shouldn't make sense, but I do agree, it doesn't make sense to put a token
>> on one node in one DC onto another node in another DC.
>>
>
> This is not what I was trying to say.  I should have used an example to
> express myself clearer.  Here goes (disclaimer: it might sound like a rant,
> take it with a grain of salt):
>
> $ ccm create -v 3.0.15 -n 3:3 -s 2dcs
>
> For a more meaningful multi-DC setup than the default SimpleStrategy, use
> NTS:
>
> $ ccm node1 cqlsh -e "ALTER KEYSPACE system_auth WITH replication =
> {'class': 'NetworkTopologyStrategy', 'dc1': 2, 'dc2': 2};"
>
> $ ccm node1 nodetool ring
>
> Datacenter: dc1
> ==
> AddressRackStatus State   LoadOwns
> Token
>
> 3074457345618258602
> 127.0.0.1  r1  Up Normal  117.9 KB66.67%
> -9223372036854775808
> 127.0.0.2  r1  Up Normal  131.56 KB   66.67%
> -3074457345618258603
> 127.0.0.3  r1  Up Normal  117.88 KB   66.67%
> 3074457345618258602
>
> Datacenter: dc2
> ==
> AddressRackStatus State   LoadOwns
> Token
>
> 3074457345618258702
> 127.0.0.4  r1  Up Normal  121.54 KB   66.67%
> -9223372036854775708
> 127.0.0.5  r1  Up Normal  118.59 KB   66.67%
> -3074457345618258503
> 127.0.0.6  r1  Up Normal  114.12 KB   66.67%
> 3074457345618258702
>
> Note that CCM is aware of the cross-DC clashes and selects the tokens for
> the dc2 shifted by a 100.
>
> Then look at the token ring (output abbreviated and aligned by me):
>
> $ ccm node1 nodetool describering system_auth
>
> Schema Version:4f7d0ad0-350d-3ea0-ae8b-53d5bc34fc7e
> TokenRange:
> TokenRange(start_token:-9223372036854775808,
> end_token:-9223372036854775708, endpoints:[127.0.0.4, 127.0.0.2,
> 127.0.0.5, 127.0.0.3], ... TokenRange(start_token:-9223372036854775708,
> end_token:-3074457345618258603, endpoints:[127.0.0.2, 127.0.0.5,
> 127.0.0.3, 127.0.0.6], ... TokenRange(start_token:-3074457345618258603,
> end_token:-3074457345618258503, endpoints:[127.0.0.5, 127.0.0.3,
> 127.0.0.6, 127.0.0.1], ...
> TokenRange(start_token:-3074457345618258503, end_token:
> 3074457345618258602, endpoints:[127.0.0.3, 127.0.0.6, 127.0.0.1,
> 127.0.0.4], ... TokenRange(start_token: 3074457345618258602, end_token:
> 3074457345618258702, endpoints:[127.0.0.6, 127.0.0.1, 127.0.0.4,
> 127.0.0.2], ...
> TokenRange(start_token: 3074457345618258702, end_token:-9223372036854775808,
> endpoints:[127.0.0.1, 127.0.0.4, 127.0.0.2, 127.0.0.5], ...
>
> So in this setup, every token range has one end contributed by a node from
> dc1 and the other end -- from dc2.  That doesn't model anything in the real
> topology of the cluster.
>
> I see that it's easy to lump together tokens from all nodes and sort them,
> to produce a single token ring (and this is obviously the reason why tokens
> have to be unique throughout the cluster as a whole).  That doesn't mean
> it's a meaningful thing to do.
>
> This introduces complexity which not present in the problem domain
> initially.  This was a deliberate choice of developers, dare I say, to
> complect the separate DCs together in a single token ring.  This has
> profound consequences from the operations side.  If anything, it prevents
> bootstrapping multiple nodes at the same time even if they are in different
> DCs.  Or would you suggest to set consistent_range_movement=false and
> hope it will work out?
>
> If the whole reason for having separate DCs is to provide isolation, I
> fail to see how the single token ring design does anything towards
> achieving that.
>
> But also being very clear (I want to make sure I understand what you're
>> saying): that's a manual thing you did, Cassandra didn't do it for you,
>> right? The fact that Cassandra didn't STOP you from doing it could be
>> considered a bug, but YOU made that config choice?
>>
>
> Yes, we have chosen exactly the same token for two nodes in different DCs
>

Re: Not what I‘ve expected Performance

2018-02-01 Thread kurt greaves

That extra code is not necessary, it's just to only retrieve a sampling of
let's. You don't want it if you're copying the whole table. It sounds like
you're taking the right approach, probably just need some more tuning.
Might be on the Cassandra side as well (concurrent_reads/writes).

On 1 Feb. 2018 19:06, "Jürgen Albersdorfer" <jalbersdor...@gmail.com> wrote:

Hi Kurt, thanks for your response.
I indeed utilized Spark - what I've forgot to mention - and I did it nearly
the same as in the example you gave me.
Just without that .select(PK).sample(false, 0.1) Instruction which I don't
actually get what it's useful for - and maybe that's the key to the castle.

I already found out that I require some more Spark Executors - really lots
of them.
And it was a bad Idea in the first place to ./spark-submit without any
parameters about executor-memory, total-executor-cores and especially
executor-cores.
I now submitted it with --executor-cores 1 --total-executor-cores 100 --
executor-memory 8G to get more Executors out of my Cluster.
Without that limits, a Spark Executor will utilize all of the available
cores. With the limitations, The Spark Worker will be able to start more
Workers in parallel which boosts in my example,
but is still way to slow and far away from requiring to throttle it. And
that is what I actually expected when 100 Processes start beating with the
Database Cluster.

Definitelly I'll give your Code a try.

2018-02-01 6:36 GMT+01:00 kurt greaves <k...@instaclustr.com>:

> How are you copying? With CQLSH COPY or your own script? If you've got
> spark already it's quite simple to copy between tables and it should be
> pretty much as fast as you can get it. (you may even need to throttle).
> There's some sample code here (albeit it's copying between clusters but
> easily tailored to copy between tables). https://www.instaclus
> tr.com/support/documentation/apache-spark/using-spark-to-
> sample-data-from-one-cassandra-cluster-and-write-to-another/
>
> On 30 January 2018 at 21:05, Jürgen Albersdorfer <jalbersdor...@gmail.com>
> wrote:
>
>> Hi, We are using C* 3.11.1 with a 9 Node Cluster built on CentOS Servers
>> eac=
>> h having 2x Quad Core Xeon, 128GB of RAM and two separate 2TB spinning
>> Disks=
>> , one for Log one for Data with Spark on Top.
>>
>> Due to bad Schema (Partitions of about 4 to 8 GB) I need to copy a whole
>> Tab=
>> le into another having same fields but different partitioning.=20
>>
>> I expected glowing Iron when I started the copy Job, but instead cannot
>> even=
>> See some Impact on CPU, mem or disks. - but the Job does copy the Data
>> over=
>> veeerry slowly at about a MB or two per Minute.
>>
>> Any suggestion where to start investigation?
>>
>> Thanks already
>>
>> Von meinem iPhone gesendet
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>

Re: Nodes show different number of tokens than initially

2018-01-31 Thread kurt greaves

>
> I don’t know why this is a surprise (maybe because people like to talk
> about multiple rings, but the fact that replication strategy is set per
> keyspace and that you could use SimpleStrategy in a multiple dc cluster
> demonstrates this), but we can chat about that another time

This is actually a point of confusion for a lot of new users. It seems
obvious for people who know the internals or who have been around since
pre-NTS/vnodes, but it's really not. Especially because NTS makes it seem
like there are two separate rings.

> that's a manual thing you did, Cassandra didn't do it for you, right? The
> fact that Cassandra didn't STOP you from doing it could be considered a
> bug, but YOU made that config choice?

This should be fairly easy to reproduce, however Kurt mentioned that there
> supposed to be some sort of protection against that. I'll try again
> tomorrow.

Sorry, the behaviour was expected. I was under the impression that you
couldn't 'steal' a token from another node (thought C* stopped you), and I
misread the code. It actually gives the token up to the new node - not the
other way round. I haven't thought about it long enough to really consider
what the behaviour should be, or whether the current behaviour is right or
wrong though.

Re: Security Updates

2018-01-31 Thread kurt greaves

Regarding security releases, nothing currently exists to notify users when
security related patches are released. At the moment I imagine
announcements would only be made in NEWS.txt or on the user mailing list...
but only if you're lucky.

On 31 January 2018 at 19:18, Michael Shuler  wrote:

> I should also mention the dev@ mailing list - this is where the [VOTE]
> emails are sent and you'd get an advanced heads up on upcoming releases,
> along with the release emails that are sent to both user@ and dev@. The
> dev@ traffic is generally lower than user@, so pretty easy to spot votes
> & releases.
>
> --
> Michael
>
> On 01/31/2018 01:12 PM, Michael Shuler wrote:
> > I usually install cron-apt for Ubuntu & Debian, forward and read root's
> > email to be notified of all system upgrades, including Cassandra.
> >
> > There are likely other utilities for other operating systems, or just a
> > cron script that checks for system update & emails would work, too.
> >
> > Also, it's possible to use something like urlwatch to look for changes
> > on http://cassandra.apache.org/download/ or any site, email out a
> > notification, etc. Maybe http://fetchrss.com/ or similar would work?
> >
> > I think there are a multitude of immediate ways to do this, until there
> > is a site patch submitted to JIRA for RSS addition.
> >
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Upgrading sstables not using all available compaction slots on version 2.2

2018-01-31 Thread kurt greaves

Would you be able to create a JIRA ticket for this? Not sure if this is
still a problem in 3.0+ but worth creating a ticket to investigate. It'd be
really helpful if you could try and reproduce on 3.0.15 or 3.11.1 to see if
it's an issue there as well.

Re: group by select queries

2018-01-31 Thread kurt greaves

Seems problematic. Would you be able to create a JIRA ticket with the above
information/examples?

On 30 January 2018 at 22:41, Modha, Digant <digant.mo...@tdsecurities.com>
wrote:

> It was local quorum.  There’s no difference with CONSISTENCY ALL.
>
>
>
> Consistency level set to LOCAL_QUORUM.
>
> cassandra@cqlsh> select * from wp.position  where account_id = 'user_1';
>
>
>
> account_id | security_id | counter | avg_exec_price | pending_quantity |
> quantity | transaction_id | update_time
>
> +-+-++--
> +--++-
>
>  user_1 |AMZN |   2 | 1239.2 |0
> | 1011 |   null | 2018-01-25 17:18:07.158000+
>
>  user_1 |AMZN |   1 | 1239.2 |0
> | 1010 |   null | 2018-01-25 17:18:07.158000+
>
>
>
> (2 rows)
>
> cassandra@cqlsh> select * from wp.position  where account_id = 'user_1'
> group by security_id;
>
>
>
> account_id | security_id | counter | avg_exec_price | pending_quantity |
> quantity | transaction_id | update_time
>
> +-+-++--
> +--++-
>
>  user_1 |AMZN |   1 | 1239.2 |0
> | 1010 |   null | 2018-01-25 17:18:07.158000+
>
>
>
> (1 rows)
>
> cassandra@cqlsh> select account_id,security_id, counter,
> avg_exec_price,quantity, update_time from wp.position  where account_id =
> 'user_1' group by security_id ;
>
>
>
> account_id | security_id | counter | avg_exec_price | quantity |
> update_time
>
> +-+-++--
> +-
>
>  user_1 |AMZN |   2 | 1239.2 | 1011 |
> 2018-01-25 17:18:07.158000+
>
>
>
> (1 rows)
>
> cassandra@cqlsh>  consistency all;
>
> Consistency level set to ALL.
>
> cassandra@cqlsh> select * from wp.position  where account_id = 'user_1'
> group by security_id;
>
>
>
> account_id | security_id | counter | avg_exec_price | pending_quantity |
> quantity | transaction_id | update_time
>
> +-+-++--
> +--++-
>
>  user_1 |AMZN |   1 | 1239.2 |0
> | 1010 |   null | 2018-01-25 17:18:07.158000+
>
>
>
> (1 rows)
>
> cassandra@cqlsh> select account_id,security_id, counter,
> avg_exec_price,quantity, update_time from wp.position  where account_id =
> 'user_1' group by security_id ;
>
>
>
> account_id | security_id | counter | avg_exec_price | quantity |
> update_time
>
> +-+-++--
> +-
>
>  user_1 |AMZN |   2 | 1239.2 | 1011 |
> 2018-01-25 17:18:07.158000+
>
>
>
>
>
> *From:* kurt greaves [mailto:k...@instaclustr.com]
> *Sent:* Monday, January 29, 2018 11:03 PM
> *To:* User
> *Subject:* Re: group by select queries
>
>
>
> What consistency were you querying at? Can you retry with CONSISTENCY ALL?
>
>
>
> 
>
>
> TD Securities disclaims any liability or losses either direct or
> consequential caused by the use of this information. This communication is
> for informational purposes only and is not intended as an offer or
> solicitation for the purchase or sale of any financial instrument or as an
> official confirmation of any transaction. TD Securities is neither making
> any investment recommendation nor providing any professional or advisory
> services relating to the activities described herein. All market prices,
> data and other information are not warranted as to completeness or accuracy
> and are subject to change without notice Any products described herein are
> (i) not insured by the FDIC, (ii) not a deposit or other obligation of, or
> guaranteed by, an insured depository institution and (iii) subject to
> investment risks, including possible loss of the principal amount invested.
> The information shall not be further distributed or duplicated in whole or
> in part by any means without the prior written consent of TD Securities. TD
> Securities is a trademark of The Toronto-Dominion Bank and represents TD
> Securities (USA) LLC and certain investment banking activities of The
> Toronto-Dominion Bank and its subsidiaries.
>

Re: Not what I‘ve expected Performance

2018-01-31 Thread kurt greaves

How are you copying? With CQLSH COPY or your own script? If you've got
spark already it's quite simple to copy between tables and it should be
pretty much as fast as you can get it. (you may even need to throttle).
There's some sample code here (albeit it's copying between clusters but
easily tailored to copy between tables).
https://www.instaclustr.com/support/documentation/apache-spark/using-spark-to-sample-data-from-one-cassandra-cluster-and-write-to-another/

On 30 January 2018 at 21:05, Jürgen Albersdorfer 
wrote:

> Hi, We are using C* 3.11.1 with a 9 Node Cluster built on CentOS Servers
> eac=
> h having 2x Quad Core Xeon, 128GB of RAM and two separate 2TB spinning
> Disks=
> , one for Log one for Data with Spark on Top.
>
> Due to bad Schema (Partitions of about 4 to 8 GB) I need to copy a whole
> Tab=
> le into another having same fields but different partitioning.=20
>
> I expected glowing Iron when I started the copy Job, but instead cannot
> even=
> See some Impact on CPU, mem or disks. - but the Job does copy the Data
> over=
> veeerry slowly at about a MB or two per Minute.
>
> Any suggestion where to start investigation?
>
> Thanks already
>
> Von meinem iPhone gesendet
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: TWCS not deleting expired sstables

2018-01-31 Thread kurt greaves

Well, that shouldn't happen. Seems like it's possibly not looking in the
correct location for data directories. Try setting CASSANDRA_INCLUDE= prior to running the script?
e.g: CASSANDRA_INCLUDE=/cassandra.in.sh
sstableexpiredblockers ae raw_logs_by_user

On 30 January 2018 at 15:34, Thakrar, Jayesh <jthak...@conversantmedia.com>
wrote:

> Thanks Kurt and Kenneth.
>
>
>
> Now only if they would work as expected.
>
>
>
> *node111.ord.ae.tsg.cnvr.net:/ae/disk1/data/ae/raw_logs_by_user-f58b9960980311e79ac26928246f09c1>ls
> -lt | tail *
>
> -rw-r--r--. 1 vchadoop vchadoop286889260 Sep 18 14:14
> mc-1070-big-Index.db
>
> -rw-r--r--. 1 vchadoop vchadoop12236 Sep 13 20:53
> mc-178-big-Statistics.db
>
> -rw-r--r--. 1 vchadoop vchadoop   92 Sep 13 20:53
> mc-178-big-TOC.txt
>
> -rw-r--r--. 1 vchadoop vchadoop  9371211 Sep 13 20:53
> mc-178-big-CompressionInfo.db
>
> -rw-r--r--. 1 vchadoop vchadoop   10 Sep 13 20:53
> mc-178-big-Digest.crc32
>
> -rw-r--r--. 1 vchadoop vchadoop  13609890747 <(360)%20989-0747> Sep 13
> 20:53 mc-178-big-Data.db
>
> -rw-r--r--. 1 vchadoop vchadoop  1394968 Sep 13 20:53
> mc-178-big-Summary.db
>
> -rw-r--r--. 1 vchadoop vchadoop 11172592 Sep 13 20:53
> mc-178-big-Filter.db
>
> -rw-r--r--. 1 vchadoop vchadoop190508739 Sep 13 20:53
> mc-178-big-Index.db
>
> drwxr-xr-x. 2 vchadoop vchadoop   10 Sep 12 21:47 backups
>
>
>
> *node111.ord.ae.tsg.cnvr.net:/ae/disk1/data/ae/raw_logs_by_user-f58b9960980311e79ac26928246f09c1>sstableexpiredblockers
> ae raw_logs_by_user*
>
> Exception in thread "main" java.lang.IllegalArgumentException: Unknown
> keyspace/table ae.raw_logs_by_user
>
> at org.apache.cassandra.tools.SSTableExpiredBlockers.main(
> SSTableExpiredBlockers.java:66)
>
>
>
> *node111.ord.ae.tsg.cnvr.net:/ae/disk1/data/ae/raw_logs_by_user-f58b9960980311e79ac26928246f09c1>sstableexpiredblockers
> system peers*
>
> No sstables for system.peers
>
>
>
> *node111.ord.ae.tsg.cnvr.net:/ae/disk1/data/ae/raw_logs_by_user-f58b9960980311e79ac26928246f09c1>ls
> -l ../../system/peers-37f71aca7dc2383ba70672528af04d4f/*
>
> total 308
>
> drwxr-xr-x. 2 vchadoop vchadoop 10 Sep 11 22:59 backups
>
> -rw-rw-r--. 1 vchadoop vchadoop 83 Jan 25 02:11
> mc-137-big-CompressionInfo.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 180369 Jan 25 02:11 mc-137-big-Data.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 10 Jan 25 02:11 mc-137-big-Digest.crc32
>
> -rw-rw-r--. 1 vchadoop vchadoop 64 Jan 25 02:11 mc-137-big-Filter.db
>
> -rw-rw-r--. 1 vchadoop vchadoop386 Jan 25 02:11 mc-137-big-Index.db
>
> -rw-rw-r--. 1 vchadoop vchadoop   5171 Jan 25 02:11
> mc-137-big-Statistics.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 56 Jan 25 02:11 mc-137-big-Summary.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 92 Jan 25 02:11 mc-137-big-TOC.txt
>
> -rw-rw-r--. 1 vchadoop vchadoop 43 Jan 29 21:11
> mc-138-big-CompressionInfo.db
>
> -rw-rw-r--. 1 vchadoop vchadoop   9723 Jan 29 21:11 mc-138-big-Data.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 10 Jan 29 21:11 mc-138-big-Digest.crc32
>
> -rw-rw-r--. 1 vchadoop vchadoop 16 Jan 29 21:11 mc-138-big-Filter.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 17 Jan 29 21:11 mc-138-big-Index.db
>
> -rw-rw-r--. 1 vchadoop vchadoop   5015 Jan 29 21:11
> mc-138-big-Statistics.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 56 Jan 29 21:11 mc-138-big-Summary.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 92 Jan 29 21:11 mc-138-big-TOC.txt
>
> -rw-rw-r--. 1 vchadoop vchadoop 43 Jan 29 21:53
> mc-139-big-CompressionInfo.db
>
> -rw-rw-r--. 1 vchadoop vchadoop  18908 Jan 29 21:53 mc-139-big-Data.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 10 Jan 29 21:53 mc-139-big-Digest.crc32
>
> -rw-rw-r--. 1 vchadoop vchadoop 16 Jan 29 21:53 mc-139-big-Filter.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 36 Jan 29 21:53 mc-139-big-Index.db
>
> -rw-rw-r--. 1 vchadoop vchadoop   5055 Jan 29 21:53
> mc-139-big-Statistics.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 56 Jan 29 21:53 mc-139-big-Summary.db
>
> -rw-rw-r--. 1 vchadoop vchadoop 92 Jan 29 21:53 mc-139-big-TOC.txt
>
>
>
>
>
>
>
> *From: *Kenneth Brotman <kenbrot...@yahoo.com.INVALID>
> *Date: *Tuesday, January 30, 2018 at 7:37 AM
> *To: *<user@cassandra.apache.org>
> *Subject: *RE: TWCS not deleting expired sstables
>
>
>
> Wow!  It’s in the DataStax documentation: https://docs.datastax.com/en/
> dse/5.1/dse-admin/datastax_enterprise/tools/toolsSStables/
> toolsSStabExpiredBlockers.html
>
>
>
> Other nice tools there as well: h

Re: Cleanup blocking snapshots - Options?

2018-01-31 Thread kurt greaves

Thanks Thomas. I'll give it a shot myself and see if backporting the patch
fixes the problem. If it does I'll create a new ticket for backporting.

On 30 January 2018 at 09:22, Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hi Kurt,
>
>
>
> had another try now, and yes, with 2.1.18, this constantly happens.
> Currently running nodetool cleanup on a single node in production with
> disabled hourly snapshots. SSTables with > 100G involved here. Triggering
> nodetool snapshot will result in being blocked. From an operational
> perspective, a bit annoying right now 
>
>
>
> Have asked on https://issues.apache.org/jira/browse/CASSANDRA-13873
> regarding a backport to 2.1, but possibly won’t get attention, cause the
> ticket has been resolved for 2.2+ already.
>
>
>
> Regards,
>
> Thomas
>
>
>
> *From:* kurt greaves [mailto:k...@instaclustr.com]
> *Sent:* Montag, 15. Jänner 2018 06:18
> *To:* User <user@cassandra.apache.org>
> *Subject:* Re: Cleanup blocking snapshots - Options?
>
>
>
> Disabling the snapshots is the best and only real option other than
> upgrading at the moment. Although apparently it was thought that there was
> only a small race condition in 2.1 that triggered this and it wasn't worth
> fixing. If you are triggering it easily maybe it is worth fixing in 2.1 as
> well. Does this happen consistently? Can you provide some more logs on the
> JIRA or better yet a way to reproduce?
>
>
>
> On 14 January 2018 at 16:12, Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hello,
>
>
>
> we are running 2.1.18 with vnodes in production and due to (
> https://issues.apache.org/jira/browse/CASSANDRA-11155) we can’t run
> cleanup e.g. after extending the cluster without blocking our hourly
> snapshots.
>
>
>
> What options do we have to get rid of partitions a node does not own
> anymore?
>
> · Using a version which has this issue fixed, although upgrading
> to 2.2+, due to various issues, is not an option at the moment
>
> · Temporarily disabling the hourly cron job before starting
> cleanup and re-enable after cleanup has finished
>
> · Any other way to re-write SSTables with data a node owns after
> a cluster scale out
>
>
>
> Thanks,
>
> Thomas
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freist
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
> ädterstra
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
> ße 313
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>

1 2 3 4 >

1 - 100 of 339 matches

Mail list logo