Re: 3.11.2 memory leak

2018-07-22 Thread kurt greaves
Likely in the next few weeks.

On Mon., 23 Jul. 2018, 01:17 Abdul Patel,  wrote:

> Any idea when 3.11.3 is coming in?
>
> On Tuesday, June 19, 2018, kurt greaves  wrote:
>
>> At this point I'd wait for 3.11.3. If you can't, you can get away with
>> backporting a few repair fixes or just doing sub range repairs on 3.11.2
>>
>> On Wed., 20 Jun. 2018, 01:10 Abdul Patel,  wrote:
>>
>>> Hi All,
>>>
>>> Do we kmow whats the stable version for now if u wish to upgrade ?
>>>
>>> On Tuesday, June 5, 2018, Steinmaurer, Thomas <
>>> thomas.steinmau...@dynatrace.com> wrote:
>>>
>>>> Jeff,
>>>>
>>>>
>>>>
>>>> FWIW, when talking about
>>>> https://issues.apache.org/jira/browse/CASSANDRA-13929, there is a
>>>> patch available since March without getting further attention.
>>>>
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Thomas
>>>>
>>>>
>>>>
>>>> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
>>>> *Sent:* Dienstag, 05. Juni 2018 00:51
>>>> *To:* cassandra 
>>>> *Subject:* Re: 3.11.2 memory leak
>>>>
>>>>
>>>>
>>>> There have been a few people who have reported it, but nobody (yet) has
>>>> offered a patch to fix it. It would be good to have a reliable way to
>>>> repro, and/or an analysis of a heap dump demonstrating the problem (what's
>>>> actually retained at the time you're OOM'ing).
>>>>
>>>>
>>>>
>>>> On Mon, Jun 4, 2018 at 6:52 AM, Abdul Patel 
>>>> wrote:
>>>>
>>>> Hi All,
>>>>
>>>>
>>>>
>>>> I recently upgraded my non prod cluster from 3.10 to 3.11.2.
>>>>
>>>> It was working fine for a 1.5 weeks then suddenly nodetool info startee
>>>> reporting 80% and more memory consumption.
>>>>
>>>> Intially it was 16gb configured, then i bumped to 20gb and rebooted all
>>>> 4 nodes of cluster-single DC.
>>>>
>>>> Now after 8 days i again see 80% + usage and its 16gb and above ..which
>>>> we never saw before .
>>>>
>>>> Seems like memory leak bug?
>>>>
>>>> Does anyone has any idea ? Our 3.11.2 release rollout has been halted
>>>> because of this.
>>>>
>>>> If not 3.11.2 whats the next best stable release we have now?
>>>>
>>>>
>>>> The contents of this e-mail are intended for the named addressee only.
>>>> It contains information that may be confidential. Unless you are the named
>>>> addressee or an authorized designee, you may not copy or use it, or
>>>> disclose it to anyone else. If you received it in error please notify us
>>>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>>>> number FN 91482h) is a company registered in Linz whose registered office
>>>> is at 4040 Linz, Austria, Freistädterstraße 313
>>>>
>>>


Re: Limitations of Hinted Handoff OverloadedException exception

2018-07-16 Thread kurt greaves
The coordinator will refuse to send writes/hints to a node if it has a
large backlog of hints (128 * #cores) already and the destination replica
is one of the nodes with hints destined to it.
It will still send writes to any "healthy" node (a node with no outstanding
hints).

The idea is to not further overload already overloaded nodes. If you see
OverloadedExceptions you'll have to repair after all nodes become stable.

See StorageProxy.java#L1327

 called from StorageProxy.java::sendToHintedEndpoints()


On 13 July 2018 at 05:38, Karthick V  wrote:

> Refs : https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/
> opsRepairNodesHintedHandoff.html
>
> On Thu, Jul 12, 2018 at 7:46 PM Karthick V  wrote:
>
>> Hi everyone,
>>
>>  If several nodes experience brief outages simultaneously, substantial
>>> memory pressure can build up on the coordinator.* The coordinator
>>> tracks how many hints it is currently writing, and if the number increases
>>> too much, the coordinator refuses writes and throws the *
>>> *OverloadedException exception.*
>>
>>
>>  In the above statement, it is been said that after some extent(of
>> hints) the* coordinator *will refuse to writes. can someone explain the
>> depth of this limitations and its dependency if any (like disk size or any)?
>>
>> Regards
>> Karthick V
>>
>>
>>


Re: Recommended num_tokens setting for small cluster

2018-08-29 Thread kurt greaves
For 10 nodes you probably want to use between 32 and 64. Make sure you use
the token allocation algorithm by specifying allocate_tokens_for_keyspace

On Thu., 30 Aug. 2018, 04:40 Jeff Jirsa,  wrote:

> 3.0 has a (optional?) feature to guarantee better distribution, and the
> blog focuses on 2.2.
>
> Using fewer will minimize your risk of unavailability if any two hosts
> fail.
>
> --
> Jeff Jirsa
>
>
> On Aug 29, 2018, at 11:18 AM, Max C.  wrote:
>
> Hello Everyone,
>
> Datastax recommends num_tokens = 8 as a sensible default, rather than
> num_tokens = 256:
>
>
> https://docs.datastax.com/en/dse/5.1/dse-dev/datastax_enterprise/config/configVnodes.html
>
> … but then I see stories like this (unbalanced cluster when using
> num_tokens=12), which are very concerning:
>
>
> https://danielparker.me/cassandra/vnodes/tokens/increasing-vnodes-cassandra/
>
> We’re currently running 3.0.x, 3 nodes, RF=3, num_tokens=256, spinning
> disks, soon to be 2 DCs.   My guess is that our cluster will probably not
> grow beyond 10 nodes (10 TB?)
>
> I’d like to minimize the chance of hitting a roadblock down the road due
> to having num_tokens set inappropriately.   We can change this right now
> pretty easily (our dataset is small but growing).  Should we switch from
> 256 to 8?  32?
>
> Has anyone had num_tokens = 8 (or similarly small number) and experienced
> growing pains?  What do you think the recommended setting should be?
>
> Thanks for the advice.  :-)
>
> - Max
>
>


[ANNOUNCE] LDAP Authenticator for Cassandra

2018-07-05 Thread kurt greaves
We've seen a need for an LDAP authentication implementation for Apache
Cassandra so we've gone ahead and created an open source implementation
(ALv2) utilising the pluggable auth support in C*.

Now, I'm positive there are multiple implementations floating around that
haven't been open sourced, and that's understandable given how much of a
nightmare working with LDAP is, so we've come up with an implementation
that will hopefully work for the general case, but should be perfectly
possible to extend, or at least use an example to create your own and maybe
contribute something back ;). It's by no means perfect, but it seems to
work, and we're hoping people with actual LDAP environments can test and
add support/improvements for more weird LDAP based use cases.

You can find the code and setup + configuration instructions on github
, and a blog that goes into
more detail here
.

PS: Don't look too closely at the nasty cache hackery in the 3.11 branch,
I'll fix it in 4.0, I promise. Just be satisfied that it works, I think.


Re: default_time_to_live vs TTL on insert statement

2018-07-11 Thread kurt greaves
The Datastax documentation is wrong. It won't error, and it shouldn't. If
you want to fix that documentation I suggest contacting Datastax.

On 11 July 2018 at 19:56, Nitan Kainth  wrote:

> Hi DuyHai,
>
> Could you please explain in what case C* will error based on documented
> statement:
>
> You can set a default TTL for an entire table by setting the table's
> default_time_to_live
> 
>  property. If you try to set a TTL for a specific column that is longer
> than the time defined by the table TTL, Cassandra returns an error.
>
>
>
> On Wed, Jul 11, 2018 at 2:34 PM, DuyHai Doan  wrote:
>
>> default_time_to_live
>> 
>>  property applies if you don't specify any TTL on your CQL statement
>>
>> However you can always override the default_time_to_live
>> 
>>  property by specifying a custom value for each CQL statement
>>
>> The behavior is correct, nothing wrong here
>>
>> On Wed, Jul 11, 2018 at 7:31 PM, Nitan Kainth 
>> wrote:
>>
>>> Hi,
>>>
>>> As per document: https://docs.datastax.com/en/cql/3.3/cql/cql_using
>>> /useExpireExample.html
>>>
>>>
>>>-
>>>
>>>You can set a default TTL for an entire table by setting the table's
>>>default_time_to_live
>>>
>>> 
>>> property. If you try to set a TTL for a specific column that is
>>>longer than the time defined by the table TTL, Cassandra returns an 
>>> error.
>>>
>>>
>>> When I tried to test this statement, i found, we can insert data with
>>> TTL greater than default_time_to_live. Is the document needs correction, or
>>> am I mis-understanding it?
>>>
>>> CREATE TABLE test (
>>>
>>> name text PRIMARY KEY,
>>>
>>> description text
>>>
>>> ) WITH bloom_filter_fp_chance = 0.01
>>>
>>> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>>>
>>> AND comment = ''
>>>
>>> AND compaction = {'class': 'org.apache.cassandra.db.compa
>>> ction.SizeTieredCompactionStrategy', 'max_threshold': '32',
>>> 'min_threshold': '4'}
>>>
>>> AND compression = {'chunk_length_in_kb': '64', 'class': '
>>> org.apache.cassandra.io.compress.LZ4Compressor'}
>>>
>>> AND crc_check_chance = 1.0
>>>
>>> AND dclocal_read_repair_chance = 0.1
>>>
>>> AND default_time_to_live = 240
>>>
>>> AND gc_grace_seconds = 864000
>>>
>>> AND max_index_interval = 2048
>>>
>>> AND memtable_flush_period_in_ms = 0
>>>
>>> AND min_index_interval = 128
>>>
>>> AND read_repair_chance = 0.0
>>>
>>> AND speculative_retry = '99PERCENTILE';
>>>
>>> insert into test (name, description) values ('name5', 'name
>>> description5') using ttl 360;
>>>
>>> select * from test ;
>>>
>>>
>>>  name  | description
>>>
>>> ---+---
>>>
>>>  name5 | name description5
>>>
>>>
>>> SELECT TTL (description) from test;
>>>
>>>
>>>  ttl(description)
>>>
>>> --
>>>
>>>  351
>>>
>>> Can someone please clear this for me?
>>>
>>>
>>>
>>>
>>>
>>>
>>
>


Re: Cassandra 2.1.18 - Concurrent nodetool repair resulting in > 30K SSTables for a single small (GBytes) CF

2018-03-06 Thread kurt greaves
>
>  What we did have was some sort of overlapping between our daily repair
> cronjob and the newly added node still in progress joining. Don’t know if
> this sort of combination might causing troubles.

I wouldn't be surprised if this caused problems. Probably want to avoid
that.

with waiting a few minutes after each finished execution and every time I
> see “… out of sync …” log messages in context of the repair, so it looks
> like, that each repair execution is detecting inconsistencies. Does this
> make sense?

Well it doesn't, but there have been issues in the past that caused exactly
this problem. I was under the impression they were all fixed by 2.1.18
though.

Additionally, we are writing at CL ANY, reading at ONE and repair chance
> for the 2 CFs in question is default 0.1

Have you considered writing at least at CL [LOCAL_]ONE? At the very least
it would rule out if there's a problem with hints.
​


Re: One time major deletion/purge vs periodic deletion

2018-03-07 Thread kurt greaves
The important point to consider is whether you are deleting old data or
recently written data. How old/recent depends on your write rate to the
cluster and there's no real formula. Basically you want to avoid deleting a
lot of old data all at once because the tombstones will end up in new
SSTables and the data to be deleted will live in higher levels (LCS) or
large SSTables (STCS), which won't get compacted together for a long time.
In this case it makes no difference if you do a big purge or if you break
it up, because at the end of the day if your big purge is just old data,
all the tombstones will have to stick around for awhile until they make it
to the higher levels/bigger SSTables.

If you have to purge large amounts of old data, the easiest way is to 1.
Make sure you have at least 50% disk free (for large/major compactions)
and/or 2. Use garbagecollect compactions (3.10+)
​


Re: Right sizing Cassandra data nodes

2018-02-28 Thread kurt greaves
The problem with higher densities is operations, not querying. When you
need to add nodes/repair/do any streaming operation having more than 3TB
per node becomes more difficult. It's certainly doable, but you'll probably
run into issues. Having said that, an insert only workload is the best
candidate for higher densities.

I'll note that you don't need to bucket by partition really, if you can use
clustering keys (e.g a timestamp) Cassandra will be smart enough to only
read from the SSTables that contain the relevant rows.

But to answer your question, all data is active data. There is no inactive
data. If all you query is the past two months, that's the only data that
will be read by Cassandra. It won't go and read old data unless you tell it
to.

On 24 February 2018 at 07:02, onmstester onmstester 
wrote:

> Another Question on node density, in this scenario:
> 1. we should keep time series data of some years for a heavy write system
> in Cassandra (> 10K Ops in seconds)
> 2. the system is insert only and inserted data would never be updated
> 3. in partition key, we used number of months since 1970, so data for
> every month would be on separate partitions
> 4. because of rule 2, after the end of month previous partitions would
> never be accessed for write requests
> 5. more than 90% of read requests would concern current month partitions,
> so we merely access Old data, we should just keep them for that 10% of
> reports!
> 6. The overall read in comparison to writes are so small (like 0.0001 % of
> overall time)
>
> So, finally the question:
> Even in this scenario would the active data be the whole data (this month
> + all previous months)? or the one which would be accessed for most reads
> and writes (only the past two months)?
> Could i use more than 3TB  per node for this scenario?
>
> Sent using Zoho Mail 
>
>
>  On Tue, 20 Feb 2018 14:58:39 +0330 *Rahul Singh
> >* wrote 
>
> Node density is active data managed in the cluster divided by the number
> of active nodes. Eg. If you you have 500TB or active data under management
> then you would need 250-500 nodes to get beast like optimum performance. It
> also depends on how much memory is on the boxes and if you are using SSD
> drives. SSD doesn’t replace memory but it doesn’t hurt.
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Feb 19, 2018, 5:55 PM -0500, Charulata Sharma (charshar) <
> chars...@cisco.com>, wrote:
>
> Thanks for the response Rahul. I did not understand the “node density”
> point.
>
>
>
> Charu
>
>
>
> *From:* Rahul Singh 
> *Reply-To:* "user@cassandra.apache.org" 
> *Date:* Monday, February 19, 2018 at 12:32 PM
> *To:* "user@cassandra.apache.org" 
> *Subject:* Re: Right sizing Cassandra data nodes
>
>
>
> 1. I would keep opscenter on different cluster. Why unnecessarily put
> traffic and computing for opscenter data on a real business data cluster?
> 2. Don’t put more than 1-2 TB per node. Maybe 3TB. Node density as it
> increases creates more replication, read repairs , etc and memory usage for
> doing the compactions etc.
> 3. Can have as much as you want for snapshots as long as you have it on
> another disk or even move it to a SAN / NAS. All you may care about us the
> most recent snapshot on the physical machine / disks on a live node.
>
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
>
> On Feb 19, 2018, 3:08 PM -0500, Charulata Sharma (charshar) <
> chars...@cisco.com>, wrote:
>
> Hi All,
>
>
>
> Looking for some insight into how application data archive and purge is
> carried out for C* database. Are there standard guidelines on calculating
> the amount of space that can be used for storing data in a specific node.
>
>
>
> Some pointers that I got while researching are;
>
>
>
> -  Allocate 50% space for compaction, e.g. if data size is 50GB
> then allocate 25GB for compaction.
>
> -  Snapshot strategy. If old snapshots are present, then they
> occupy the disk space.
>
> -  Allocate some percentage of storage (  ) for system tables
> and OpsCenter tables ?
>
>
>
> We have a scenario where certain transaction data needs to be archived
> based on business rules and some purged, so before deciding on an A
> strategy, I am trying to analyze
>
> how much transactional data can be stored given the current node capacity.
> I also found out that the space available metric shown in Opscenter is not
> very reliable because it doesn’t show
>
> the snapshot space. In our case, we have a huge snapshot size. For some
> unexplained reason, we seem to be taking snapshots of our data every hour
> and purging them only after 7 days.
>
>
>
>
>
> Thanks,
>
> Charu
>
> Cisco Systems.
>
>
>
>
>
>
>
>
>
>


Re: The home page of Cassandra is mobile friendly but the link to the third parties is not

2018-02-28 Thread kurt greaves
Already addressed in CASSANDRA-14128
, however waiting on
review/comments regarding what we actually do with this page.

If you want to bring attention to JIRA's, user list is probably
appropriate. I'd avoid spamming it too much though.

On 26 February 2018 at 19:22, Kenneth Brotman 
wrote:

> The home page of Cassandra is mobile friendly but the link to the third
> parties from that web page is not.  Any suggestions?
>
>
>
> I made a JIRA for it: https://issues.apache.org/
> jira/browse/CASSANDRA-14263
>
>
>
> Should posts about JIRA’s be on this list or the dev list?
>
>
>
> Kenneth Brotman
>
>
>
>
>


Re: Best way to Drop Tombstones/after GC Grace

2018-03-14 Thread kurt greaves
At least set GCGS == max_hint_window_in_ms that way you don't effectively
disable hints for the table while your compaction is running. Might be
preferable to use nodetool garbagecollect if you don't have enough disk
space for a major compaction. Also worth noting you should do a splitting
major compaction so you don't end up with one big SSTable when using STCS
(also applicable for LCS)

On 14 March 2018 at 18:53, Jeff Jirsa  wrote:

> Can’t advise that without knowing the risk to your app if there’s data
> resurrected
>
>
> If there’s no risk, then sure - set gcgs to 0 and force / major compact if
> you have the room
>
>
>
> --
> Jeff Jirsa
>
>
> On Mar 14, 2018, at 11:47 AM, Madhu-Nosql  wrote:
>
> Jeff,
>
> Thank you i got this- how about Dropping the existing Tombstones right now
> can setting gc_grace time to zero per Table level would be good or what
> would you suggest?
>
> On Wed, Mar 14, 2018 at 1:41 PM, Jeff Jirsa  wrote:
>
>> What version of Cassandra?
>>
>> https://issues.apache.org/jira/browse/CASSANDRA-7304 sort of addresses
>> this in 2.2+
>>
>>
>>
>>
>> On Wed, Mar 14, 2018 at 11:32 AM, Madhu-Nosql 
>> wrote:
>>
>>> Rahul,
>>>
>>> Tomstone caused is on the Application driver side so even though they
>>> are not using some of the Columns in their logic
>>> waht they did is that they mentioned in driver logic that means if you
>>> are updateting one Column so the rest of the Columns so the driver
>>> automatically
>>> pick some nulls, internally behind the schnes cassandra threat them as a
>>> Tombstones
>>>
>>> On Wed, Mar 14, 2018 at 12:58 PM, Rahul Singh <
>>> rahul.xavier.si...@gmail.com> wrote:
>>>
 Then don’t write nulls. That’s the root of the issue. Sometimes they
 surface from prepared statements. Othertimes they come because of default
 null values in objects.

 --
 Rahul Singh
 rahul.si...@anant.us

 Anant Corporation

 On Mar 13, 2018, 2:18 PM -0400, Madhu-Nosql ,
 wrote:

 We assume that's becoz of nulls

 On Tue, Mar 13, 2018 at 12:58 PM, Rahul Singh <
 rahul.xavier.si...@gmail.com> wrote:

> Are you writing nulls or does the data cycle that way?
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Mar 13, 2018, 11:48 AM -0400, Madhu-Nosql ,
> wrote:
>
> Rahul,
>
> Nodetool scrub is good for rescue, what if its happening all the time?
>
> On Tue, Mar 13, 2018 at 10:37 AM, Rahul Singh <
> rahul.xavier.si...@gmail.com> wrote:
>
>> Do you anticipate this happening all the time or are you just trying
>> to rescue?
>>
>> Nodetool scrub can be useful too.
>>
>>
>> --
>> Rahul Singh
>> rahul.si...@anant.us
>>
>> Anant Corporation
>>
>> On Mar 13, 2018, 11:29 AM -0400, Madhu-Nosql ,
>> wrote:
>>
>> I got few ways to Drop Tombstones- Chos Monkey/Zombie Data mainly to
>> avoid Data Resurrection (you deleted data it will comes back in
>> future)
>>
>> I am thinking of below options, let me know if you have any best
>> practice for this
>>
>> 1.using nodetool garbagecollect
>> 2.only_purge_repaired_tombstones
>> 3.At Table level making GC_Grace_period to zero and compact
>>
>> Thanks,
>> Madhu
>>
>>
>

>>>
>>
>


Re: Removing initial_token parameter

2018-03-09 Thread kurt greaves
correct, tokens will be stored in the nodes system tables after the first
boot, so feel free to remove them (although it's not really necessary)

On 9 Mar. 2018 20:16, "Mikhail Tsaplin"  wrote:

> Is it safe to remove initial_token parameter on a cluster created by
> snapshot restore procedure presented here https://docs.datastax.com
> /en/cassandra/latest/cassandra/operations/opsSnapshotRestore
> NewCluster.html  ?
>
> For me, it seems that initial_token parameter is used only when nodes are
> started the first time and later during next reboot Cassandra obtains
> tokens from internal structures and initital_token parameter absence would
> not affect it.
>
>


Re: Cassandra/Spark failing to process large table

2018-03-08 Thread kurt greaves
Note that read repairs only occur for QUORUM/equivalent and higher, and
also with a 10% (default) chance on anything less than QUORUM
(ONE/LOCAL_ONE). This is configured at the table level through the
dclocal_read_repair_chance and read_repair_chance settings (which are going
away in 4.0). So if you read at LOCAL_ONE it would have been chance that
caused the read repair. Don't expect it to happen for every read (unless
you configure it to, or use >=QUORUM).​


Re: What versions should the documentation support now?

2018-03-13 Thread kurt greaves
>
> I’ve never heard of anyone shipping docs for multiple versions, I don’t
> know why we’d do that.  You can get the docs for any version you need by
> downloading C*, the docs are included.  I’m a firm -1 on changing that
> process.

We should still host versioned docs on the website however. Either that or
we specify "since version x" for each component in the docs with notes on
behaviour.
​


Re: Shifting data to DCOS

2018-04-06 Thread kurt greaves
Without looking at the code I'd say maybe the keyspaces are displayed
purely because the directories exist (but it seems unlikely). The process
you should follow instead is to exclude the system keyspaces for each node
and manually apply your schema, then upload your CFs into the correct
directory. Note this only works when RF=#nodes, if you have more nodes you
need to take tokens into account when restoring.

On Fri., 6 Apr. 2018, 17:16 Affan Syed,  wrote:

> Michael,
>
> both of the folders are with hash, so I dont think that would be an issue.
>
> What is strange is why the tables dont show up if the keyspaces are
> visible. Shouldnt that be a meta data that can be edited once and then be
> visible?
>
> Affan
>
> - Affan
>
> On Thu, Apr 5, 2018 at 7:55 PM, Michael Shuler 
> wrote:
>
>> On 04/05/2018 09:04 AM, Faraz Mateen wrote:
>> >
>> > For example,  if the table is *data_main_bim_dn_10*, its data directory
>> > is named data_main_bim_dn_10-a73202c02bf311e8b5106b13f463f8b9. I created
>> > a new table with the same name through cqlsh. This resulted in creation
>> > of another directory with a different hash i.e.
>> > data_main_bim_dn_10-c146e8d038c611e8b48cb7bc120612c9. I copied all data
>> > from the former to the latter.
>> >
>> > Then I ran *"nodetool refresh ks1  data_main_bim_dn_10"*. After that I
>> > was able to access all data contents through cqlsh.
>> >
>> > Now, the problem is, I have around 500 tables and the method I mentioned
>> > above is quite cumbersome. Bulkloading through sstableloader or remote
>> > seeding are also a couple of options but they will take a lot of time.
>> > Does anyone know an easier way to shift all my data to new setup on
>> DC/OS?
>>
>> For upgrade support from older versions of C* that did not have the hash
>> on the data directory, the table data dir can be just
>> `data_main_bim_dn_10` without the appended hash, as in your example.
>>
>> Give that a quick test to see if that simplifies things for you.
>>
>> --
>> Kind regards,
>> Michael
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>


Re: Many SSTables only on one node

2018-04-09 Thread kurt greaves
If there were no other messages about anti-compaction similar to:
>
> SSTable YYY (ranges) will be anticompacted on range [range]


Then no anti-compaction needed to occur and yes, it was not the cause.

On 5 April 2018 at 13:52, Dmitry Simonov  wrote:

> Hi, Evelyn!
>
> I've found the following messages:
>
> INFO RepairRunnable.java Starting repair command #41, repairing keyspace
> XXX with repair options (parallelism: parallel, primary range: false,
> incremental: false, job threads: 1, ColumnFamilies: [YYY], dataCenters: [],
> hosts: [], # of ranges: 768)
> INFO CompactionExecutor:6 CompactionManager.java Starting anticompaction
> for XXX.YYY on 5132/5846 sstables
>
> After that many similar messages go:
> SSTable BigTableReader(path='/mnt/cassandra/data/XXX/YYY-
> 4c12fd9029e611e8810ac73ddacb37d1/lb-12688-big-Data.db') fully contained
> in range (-9223372036854775808,-9223372036854775808], mutating repairedAt
> instead of anticompacting
>
> Does it means that anti-compaction is not the cause?
>
> 2018-04-05 18:01 GMT+05:00 Evelyn Smith :
>
>> It might not be what cause it here. But check your logs for
>> anti-compactions.
>>
>>
>> On 5 Apr 2018, at 8:35 pm, Dmitry Simonov  wrote:
>>
>> Thank you!
>> I'll check this out.
>>
>> 2018-04-05 15:00 GMT+05:00 Alexander Dejanovski :
>>
>>> 40 pending compactions is pretty high and you should have way less than
>>> that most of the time, otherwise it means that compaction is not keeping up
>>> with your write rate.
>>>
>>> If you indeed have SSDs for data storage, increase your compaction
>>> throughput to 100 or 200 (depending on how the CPUs handle the load). You
>>> can experiment with compaction throughput using : nodetool
>>> setcompactionthroughput 100
>>>
>>> You can raise the number of concurrent compactors as well and set it to
>>> a value between 4 and 6 if you have at least 8 cores and CPUs aren't
>>> overwhelmed.
>>>
>>> I'm not sure why you ended up with only one node having 6k SSTables and
>>> not the others, but you should apply the above changes so that you can
>>> lower the number of pending compactions and see if it prevents the issue
>>> from happening again.
>>>
>>> Cheers,
>>>
>>>
>>> On Thu, Apr 5, 2018 at 11:33 AM Dmitry Simonov 
>>> wrote:
>>>
 Hi, Alexander!

 SizeTieredCompactionStrategy is used for all CFs in problematic
 keyspace.
 Current compaction throughput is 16 MB/s (default value).

 We always have about 40 pending and 2 active "CompactionExecutor" tasks
 in "tpstats".
 Mostly because of another (bigger) keyspace in this cluster.
 But the situation is the same on each node.

 According to "nodetool compactionhistory", compactions on this CF run
 (sometimes several times per day, sometimes one time per day, the last run
 was yesterday).
 We run "repair -full" regulary for this keyspace (every 24 hours on
 each node), because gc_grace_seconds is set to 24 hours.

 Should we consider increasing compaction throughput and
 "concurrent_compactors" (as recommended for SSDs) to keep
 "CompactionExecutor" pending tasks low?

 2018-04-05 14:09 GMT+05:00 Alexander Dejanovski :

> Hi Dmitry,
>
> could you tell us which compaction strategy that table is currently
> using ?
> Also, what is the compaction max throughput and is auto-compaction
> correctly enabled on that node ?
>
> Did you recently run repair ?
>
> Thanks,
>
> On Thu, Apr 5, 2018 at 10:53 AM Dmitry Simonov 
> wrote:
>
>> Hello!
>>
>> Could you please give some ideas on the following problem?
>>
>> We have a cluster with 3 nodes, running Cassandra 2.2.11.
>>
>> We've recently discovered high CPU usage on one cluster node, after
>> some investigation we found that number of sstables for one CF on it is
>> very big: 5800 sstables, on other nodes: 3 sstable.
>>
>> Data size in this keyspace was not very big ~100-200Mb per node.
>>
>> There is no such problem with other CFs of that keyspace.
>>
>> nodetool compact solved the issue as a quick-fix.
>>
>> But I'm wondering, what was the cause? How prevent it from repeating?
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>



 --
 Best Regards,
 Dmitry Simonov

>>> --
>>> -
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Dmitry Simonov
>>
>>
>>
>
>
> --
> Best Regards,
> Dmitry 

Re: Token range redistribution

2018-04-18 Thread kurt greaves
A new node always generates more tokens. A replaced node using
replace_address[_on_first_boot] will reclaim the tokens of the node it's
replacing. Simply removing and adding back a new node without replace
address will end up with the new node having different tokens, which would
mean data loss in the use case you described.

On Wed., 18 Apr. 2018, 16:51 Akshit Jain,  wrote:

> Hi,
> If i replace a node does it redistributes the token range or when the node
> again joins will it be allocated a new token range.
>
> Use case:
> I have booted a C* on AWS. I terminated a node and then boot a new node
> assigned it the same ip and made it join the cluster.
>
> In this case would the token range be redistributed and the node will get
> the new token range.
> Would the process be different for seed nodes?
>
> Regards
> Akshit Jain
>


Re: Token range redistribution

2018-04-19 Thread kurt greaves
That's assuming your data is perfectly consistent, which is unlikely.
Typically that strategy is a bad idea and you should avoid it.

On Thu., 19 Apr. 2018, 07:00 Richard Gray, <richard.g...@smxemail.com>
wrote:

> On 2018-04-18 21:28, kurt greaves wrote:
> > replacing. Simply removing and adding back a new node without replace
> > address will end up with the new node having different tokens, which
> > would mean data loss in the use case you described.
>
> If you have replication factor N > 1, you haven't necessarily lost data
> unless you've swapped out N or more nodes (without using
> replace_address). If you've swapped out fewer than N nodes, you should
> still be able to restore consistency by running a repair.
>
> --
> Richard Gray
>
> _
>
> This email has been filtered by SMX. For more info visit
> http://smxemail.com
>
> _
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: SSTable count in Nodetool tablestats(LevelCompactionStrategy)

2018-04-20 Thread kurt greaves
I'm currently investigating this issue on one of our clusters (but much
worse, we're seeing >100 SSTables and only 2 in the levels) on 3.11.1. What
version are you using? It's definitely a bug.

On 17 April 2018 at 10:09,  wrote:

> Dear Community,
>
>
>
> One of the tables in my keyspace is using LevelCompactionStrategy and when
> I used the nodetool tablestats keyspace.table_name command, I found some
> mismatch in the count of SSTables displayed at 2 different places. Please
> refer the attached image.
>
>
>
> The command is giving SSTable count = 6 but if you add the numbers shown
> against SSTables in each level, then that comes out as 5. Why is there a
> difference?
>
>
>
> Thanks and regards,
>
> Vishal Sharma
>
>
> "*Confidentiality Warning*: This message and any attachments are intended
> only for the use of the intended recipient(s), are confidential and may be
> privileged. If you are not the intended recipient, you are hereby notified
> that any review, re-transmission, conversion to hard copy, copying,
> circulation or other use of this message and any attachments is strictly
> prohibited. If you are not the intended recipient, please notify the sender
> immediately by return email and delete this message and any attachments
> from your system.
>
> *Virus Warning:* Although the company has taken reasonable precautions to
> ensure no viruses are present in this email. The company cannot accept
> responsibility for any loss or damage arising from the use of this email or
> attachment."
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>


Re: Memtable type and size allocation

2018-04-23 Thread kurt greaves
Hi Vishal,

In Cassandra 3.11.2, there are 3 choices for the type of Memtable
> allocation and as per my understanding, if I want to keep Memtables on JVM
> heap I can use heap_buffers and if I want to store Memtables outside of JVM
> heap then I've got 2 options offheap_buffers and offheap_objects.

Heap buffers == everything is allocated on heap, e.g the entire row and its
contents.
Offheap_buffers is partially on heap partially offheap. It moves the Cell
name + value to offheap buffers. Not sure how much this has changed in 3.x
Offheap_objects moves entire cells offheap and we only keep a reference to
them on heap.

Also, the permitted memory space to be used for Memtables can be set at 2
> places in the YAML file, i.e. memtable_heap_space_in_mb and
> memtable_offheap_space_in_mb.

 Do I need to configure some space in both heap and offheap, irrespective
> of the Memtable allocation type or do I need to set only one of them based
> on my Memtable allocation type i.e. memtable_heap_space_in_mb when using
> heap buffers and memtable_offheap_space_in_mb only when using either of the
> other 2 offheap options?


Both are still relevant and used if using offheap. If not using an offheap
option only memtable_heap_space_in_mb is relevant. For the most part, the
defaults (1/4 of heap size) should be sufficient.


Re: about the tombstone and hinted handoff

2018-04-16 Thread kurt greaves
I don't think that's true/maybe that comment is misleading. Tombstones
AFAIK will be propagated by hints, and the hint system doesn't do anything
to check if a particular row has been tombstoned. To the node receiving the
hints it just looks like it's receiving a bunch of writes, it doesn't know
they are hints.

On 12 April 2018 at 13:51, Jinhua Luo  wrote:

> Hi All,
>
> In the doc:
> https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dml
> AboutDeletes.html
>
> It said "When an unresponsive node recovers, Cassandra uses hinted
> handoff to replay the database mutationsthe node missed while it was
> down. Cassandra does not replay a mutation for a tombstoned record
> during its grace period.".
>
> The tombstone here is on the recovered node or coordinator?
> The tombstone is a special write record, so it must have writetime.
> We could compare the writetime between the version in the hint and the
> version of the tombstone, which is enough to make choice, so why we
> need to wait for gc_grace_seconds here?
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Shifting data to DCOS

2018-04-16 Thread kurt greaves
Sorry for the delay.

> Is the problem related to token ranges? How can I find out token range for
> each node?
> What can I do to further debug and root cause this?

Very likely. See below.

My previous cluster has 3 nodes but replication factor is 2. I am not
> exactly sure how I would handle the tokens. Can you explain that a bit?

The new cluster will have to have the same token ring as the old if you are
copying from node to node. Basically you should get the set of tokens for
each node (from nodetool ring) and when you spin up your 3 new nodes, set
initial_tokens in the yaml to be the comma-separated list of tokens
for *exactly
one* node from the previous cluster. When restoring the SSTables you need
to make sure you take the SSTables from the original node and place it on
the new node that has the *same* list of tokens. If you don't do this it
won't be a replica for all the data in those SSTables and consequently
you'll lose data (or it simply won't be available).
​


Re: Phantom growth resulting automatically node shutdown

2018-04-19 Thread kurt greaves
This was fixed (again) in 3.0.15.
https://issues.apache.org/jira/browse/CASSANDRA-13738

On Fri., 20 Apr. 2018, 00:53 Jeff Jirsa,  wrote:

> There have also been a few sstable ref counting bugs that would over
> report load in nodetool ring/status due to overlapping normal and
> incremental repairs (which you should probably avoid doing anyway)
>
> --
> Jeff Jirsa
>
>
> On Apr 19, 2018, at 9:27 AM, Rahul Singh 
> wrote:
>
> I’ve seen something similar in 2.1. Our issue was related to file
> permissions being flipped due to an automation and C* stopped seeing
> Sstables so it started making new data — via read repair or repair
> processes.
>
> In your case if nodetool is reporting data that means that it’s growing
> due to data growth. What does your cfstats / tablestats day? Are you
> monitoring your key tables data via cfstats metrics like SpaceUsedLive or
> SpaceUsedTotal. What is your snapshottjng / backup process doing?
>
> --
> Rahul Singh
> rahul.si...@anant.us
>
> Anant Corporation
>
> On Apr 19, 2018, 7:01 AM -0500, horschi , wrote:
>
> Did you check the number of files in your data folder before & after the
> restart?
>
> I have seen cases where cassandra would keep creating sstables, which
> disappeared on restart.
>
> regards,
> Christian
>
>
> On Thu, Apr 19, 2018 at 12:18 PM, Fernando Neves  > wrote:
>
>> I am facing one issue with our Cassandra cluster.
>>
>> Details: Cassandra 3.0.14, 12 nodes, 7.4TB(JBOD) disk size in each node,
>> ~3.5TB used physical data in each node, ~42TB whole cluster and default
>> compaction setup. This size maintain the same because after the retention
>> period some tables are dropped.
>>
>> Issue: Nodetool status is not showing the correct used size in the
>> output. It keeps increasing the used size without limit until automatically
>> node shutdown or until our sequential scheduled restart(workaround 3 times
>> week). After the restart, nodetool shows the correct used space but for few
>> days.
>> Did anybody have similar problem? Is it a bug?
>>
>> Stackoverflow:
>> https://stackoverflow.com/questions/49668692/cassandra-nodetool-status-is-not-showing-correct-used-space
>>
>>
>


Re: Execute an external program

2018-04-03 Thread kurt greaves
Correct. Note that both triggers and CDC aren't widely used yet so be sure
to test.

On 28 March 2018 at 13:02, Earl Lapus  wrote:

>
> On Wed, Mar 28, 2018 at 8:39 AM, Jeff Jirsa  wrote:
>
>> CDC may also work for newer versions, but it’ll happen after the mutation
>> is applied
>>
>> --
>> Jeff Jirsa
>>
>>
> "after the mutation is applied" means after the query is executed?
>
>


Re: auto_bootstrap for seed node

2018-04-03 Thread kurt greaves
Setting auto_bootstrap on seed nodes is unnecessary and irrelevant. If the
node is a seed it will ignore auto_bootstrap and it *will not* bootstrap.

On 28 March 2018 at 15:49, Ali Hubail  wrote:

> "it seems that we still need to keep bootstrap false?"
>
> Could you shed some light on what would happen if the auto_bootstrap is
> removed (or set to true as the default value) in the seed nodes of the
> newly added DC?
>
> What do you have in the seeds param of the new DC nodes (cassandra.yaml)?
> Do you reference the old DC seed nodes there as well?
>
> *Ali Hubail*
>
> Email: ali.hub...@petrolink.com | www.petrolink.com
> Confidentiality warning: This message and any attachments are intended
> only for the persons to whom this message is addressed, are confidential,
> and may be privileged. If you are not the intended recipient, you are
> hereby notified that any review, retransmission, conversion to hard copy,
> copying, modification, circulation or other use of this message and any
> attachments is strictly prohibited. If you receive this message in error,
> please notify the sender immediately by return email, and delete this
> message and any attachments from your system. Petrolink International
> Limited its subsidiaries, holding companies and affiliates disclaims all
> responsibility from and accepts no liability whatsoever for the
> consequences of any unauthorized person acting, or refraining from acting,
> on any information contained in this message. For security purposes, staff
> training, to assist in resolving complaints and to improve our customer
> service, email communications may be monitored and telephone calls may be
> recorded.
>
>
> *"Peng Xiao" <2535...@qq.com <2535...@qq.com>>*
>
> 03/28/2018 12:54 AM
> Please respond to
> user@cassandra.apache.org
>
> To
> "user" ,
>
> cc
> Subject
> Re:  auto_bootstrap for seed node
>
>
>
>
> We followed this https://docs.datastax.com/en/cassandra/2.1/cassandra/
> operations/ops_add_dc_to_cluster_t.html,
> but it does not mention that change bootstrap for seed nodes after the
> rebuild.
>
> Thanks,
> Peng Xiao
>
>
> -- Original --
> *From: * "Ali Hubail";
> *Date: * Wed, Mar 28, 2018 10:48 AM
> *To: * "user";
> *Subject: * Re: auto_bootstrap for seed node
>
> You might want to follow DataStax docs on this one:
>
> For adding a DC to an existing cluster:
> *https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/operations/opsAddDCToCluster.html*
> 
> For adding a new node to an existing cluster:
> *https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/operations/opsAddNodeToCluster.html*
> 
>
> briefly speaking,
> adding one node to an existing cluster --> use auto_bootstrap
> adding a DC to an existing cluster --> rebuild
>
> You need to check the version of c* that you're running, and make sure you
> pick the right doc version for that.
>
> Most of my colleagues miss very important steps while adding/removing
> nodes/cluster, but if they stick to the docs, they always get it done right.
>
> Hope this helps
>
> * Ali Hubail*
>
> Confidentiality warning: This message and any attachments are intended
> only for the persons to whom this message is addressed, are confidential,
> and may be privileged. If you are not the intended recipient, you are
> hereby notified that any review, retransmission, conversion to hard copy,
> copying, modification, circulation or other use of this message and any
> attachments is strictly prohibited. If you receive this message in error,
> please notify the sender immediately by return email, and delete this
> message and any attachments from your system. Petrolink International
> Limited its subsidiaries, holding companies and affiliates disclaims all
> responsibility from and accepts no liability whatsoever for the
> consequences of any unauthorized person acting, or refraining from acting,
> on any information contained in this message. For security purposes, staff
> training, to assist in resolving complaints and to improve our customer
> service, email communications may be monitored and telephone calls may be
> recorded.
>
> *"Peng Xiao" <2535...@qq.com <2535...@qq.com>>*
>
> 03/27/2018 09:39 PM
>
>
> Please respond to
> user@cassandra.apache.org
>
> To
> "user" ,
>
> cc
> Subject
> auto_bootstrap for seed node
>
>
>
>
>
>
> Dear All,
>
> For adding a new DC ,we need to set auto_bootstrap: false and then run the
> rebuild,finally we need to change auto_bootstrap: true,but for seed
> nodes,it seems that we still need to keep bootstrap false?
> Could anyone please confirm?
>
> Thanks,
> Peng Xiao
>


Re: replace dead node vs remove node

2018-03-25 Thread kurt greaves
Didn't read the blog but it's worth noting that if you replace the node and
give it a *different* ip address repairs will not be necessary as it will
receive writes during replacement. This works as long as you start up the
replacement node before HH window ends.

https://issues.apache.org/jira/browse/CASSANDRA-12344 and
https://issues.apache.org/jira/browse/CASSANDRA-11559 fixes this for same
address replacements (hopefully in 4.0)

On Fri., 23 Mar. 2018, 15:11 Anthony Grasso, 
wrote:

> Hi Peng,
>
> Correct, you would want to repair in either case.
>
> Regards,
> Anthony
>
>
> On Fri, 23 Mar 2018 at 14:09, Peng Xiao <2535...@qq.com> wrote:
>
>> Hi Anthony,
>>
>> there is a problem with replacing dead node as per the blog,if the
>> replacement process takes longer than max_hint_window_in_ms,we must run
>> repair to make the replaced node consistent again, since it missed ongoing
>> writes during bootstrapping.but for a great cluster,repair is a painful
>> process.
>>
>> Thanks,
>> Peng Xiao
>>
>>
>>
>> -- 原始邮件 --
>> *发件人:* "Anthony Grasso";
>> *发送时间:* 2018年3月22日(星期四) 晚上7:13
>> *收件人:* "user";
>> *主题:* Re: replace dead node vs remove node
>>
>> Hi Peng,
>>
>> Depending on the hardware failure you can do one of two things:
>>
>> 1. If the disks are intact and uncorrupted you could just use the disks
>> with the current data on them in the new node. Even if the IP address
>> changes for the new node that is fine. In that case all you need to do is
>> run repair on the new node. The repair will fix any writes the node missed
>> while it was down. This process is similar to the scenario in this blog
>> post:
>> http://thelastpickle.com/blog/2018/02/21/replace-node-without-bootstrapping.html
>>
>> 2. If the disks are inaccessible or corrupted, then use the method as
>> described in the blogpost you linked to. The operation is similar to
>> bootstrapping a new node. There is no need to perform any other remove or
>> join operation on the failed or new nodes. As per the blog post, you
>> definitely want to run repair on the new node as soon as it joins the
>> cluster. In this case here, the data on the failed node is effectively lost
>> and replaced with data from other nodes in the cluster.
>>
>> Hope this helps.
>>
>> Regards,
>> Anthony
>>
>>
>> On Thu, 22 Mar 2018 at 20:52, Peng Xiao <2535...@qq.com> wrote:
>>
>>> Dear All,
>>>
>>> when one node failure with hardware errors,it will be in DN status in
>>> the cluster.Then if we are not able to handle this error in three hours(max
>>> hints window),we will loss data,right?we have to run repair to keep the
>>> consistency.
>>> And as per
>>> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html,we
>>> can replace this dead node,is it the same as bootstrap new node?that means
>>> we don't need to remove node and rejoin?
>>> Could anyone please advise?
>>>
>>> Thanks,
>>> Peng Xiao
>>>
>>>
>>>
>>>
>>>


Re: Nodetool Repair --full

2018-03-18 Thread kurt greaves
Worth noting that if you have racks == RF you only need to repair one rack
to repair all the data in the cluster if you *don't* use -pr. Also note
that full repairs on >=3.0 case anti-compactions and will mark things as
repaired, so once you start repairs you need to keep repairing to ensure
you don't have any zombie data or other problems.

On 17 March 2018 at 15:52, Hannu Kröger  wrote:

> Hi Jonathan,
>
> If you want to repair just one node (for example if it has been down for
> more than 3h), run “nodetool repair -full” on that node. This will bring
> all data on that node up to date.
>
> If you want to repair all data on the cluster, run “nodetool repair -full
> -pr” on each node. This will run full repair on all nodes but it will do it
> so only the primary range for each node is fixed. If you do it on all
> nodes, effectively the whole token range is repaired. You can run the same
> without -pr to get the same effect but it’s not efficient because then you
> are doing the repair RF times on all data instead of just repairing the
> whole data once.
>
> I hope this clarifies,
> Hannu
>
> On 17 Mar 2018, at 17:20, Jonathan Baynes 
> wrote:
>
> Hi Community,
>
> Can someone confirm, as the documentation out on the web is so
> contradictory and vague.
>
> Nodetool repair –full if I call this, do I need to run this on ALL my
> nodes or is just the once sufficient?
>
> Thanks
> J
>
> *Jonathan Baynes*
> DBA
> Tradeweb Europe Limited
> Moor Place  •  1 Fore Street Avenue
> 
>   •
> 
>   London EC2Y 9DT
> 
> P +44 (0)20 77760988 <+44%2020%207776%200988>  •  F +44 (0)20 7776 3201
> <+44%2020%207776%203201>  •  M +44 (0)7884111546 <+44%207884%20111546>
> jonathan.bay...@tradeweb.com
>
>     follow us:  **
>    <
> image003.jpg> 
> —
> A leading marketplace  for
> electronic fixed income, derivatives and ETF trading
>
>
> 
>
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and destroy it. Any unauthorized
> copying, disclosure or distribution of the material in this e-mail is
> strictly forbidden. Tradeweb reserves the right to monitor all e-mail
> communications through its networks. If you do not wish to receive
> marketing emails about our products / services, please let us know by
> contacting us, either by email at contac...@tradeweb.com or by writing to
> us at the registered office of Tradeweb in the UK, which is: Tradeweb
> Europe Limited (company number 3912826), 1 Fore Street Avenue London EC2Y
> 9DT
> .
> To see our privacy policy, visit our website @ www.tradeweb.com.
>
>
>


Re: Cassandra 2.1.18 - Concurrent nodetool repair resulting in > 30K SSTables for a single small (GBytes) CF

2018-03-04 Thread kurt greaves
Repairs with vnodes is likely to cause a lot of small SSTables if you have
inconsistencies (at least 1 per vnode). Did you have any issues when adding
nodes, or did you add multiple nodes at a time? Anything that could have
lead to a bit of inconsistency could have been the cause.

I'd probably avoid running the repairs across all the nodes simultaneously
and instead spread them out over a week. That likely made it worse. Also
worth noting that in versions 3.0+ you won't be able to run nodetool repair
in such a way because anti-compaction will be triggered which will fail if
multiple anti-compactions are attempted simultaneously (if you run multiple
repairs simultaneously).

Have a look at orchestrating your repairs with TLP's fork of
cassandra-reaper .
​


Re: C* in multiple AWS AZ's

2018-06-28 Thread kurt greaves
There is a need for a repair with both DCs as rebuild will not stream all
replicas, so unless you can guarantee you were perfectly consistent at time
of rebuild you'll want to do a repair after rebuild.

On another note you could just replace the nodes but use GPFS instead of
EC2 snitch, using the same rack name.

On Fri., 29 Jun. 2018, 00:19 Rahul Singh, 
wrote:

> Parallel load is the best approach and then switch your Data access code
> to only access the new hardware. After you verify that there are no local
> read / writes on the OLD dc and that the updates are only via Gossip, then
> go ahead and change the replication factor on the key space to have zero
> replicas in the old DC. Then you can decommissioned.
>
> This way you are hundred percent sure that you aren’t missing any new
> data. No need for a DC to DC repair but a repair is always healthy.
>
> Rahul
> On Jun 28, 2018, 9:15 AM -0500, Randy Lynn , wrote:
>
> Already running with Ec2.
>
> My original thought was a new DC parallel to the current, and then
> decommission the other DC.
>
> Also my data load is small right now.. I know small is relative term..
> each node is carrying about 6GB..
>
> So given the data size, would you go with parallel DC or let the new AZ
> carry a heavy load until the others are migrated over?
> and then I think "repair" to cleanup the replications?
>
>
> On Thu, Jun 28, 2018 at 10:09 AM, Rahul Singh <
> rahul.xavier.si...@gmail.com> wrote:
>
>> You don’t have to use EC2 snitch on AWS but if you have already started
>> with it , it may put a node in a different DC.
>>
>> If your data density won’t be ridiculous You could add 3 to different DC/
>> Region and then sync up. After the new DC is operational you can remove one
>> at a time on the old DC and at the same time add to the new one.
>>
>> Rahul
>> On Jun 28, 2018, 9:03 AM -0500, Randy Lynn , wrote:
>>
>> I have a 6-node cluster I'm migrating to the new i3 types.
>> But at the same time I want to migrate to a different AZ.
>>
>> What happens if I do the "running node replace method" with 1 node at a
>> time moving to the new AZ. Meaning, I'll have temporarily;
>>
>> 5 nodes in AZ 1c
>> 1 new node in AZ 1e.
>>
>> I'll wash-rinse-repeat till all 6 are on the new machine type and in the
>> new AZ.
>>
>> Any thoughts about whether this gets weird with the Ec2Snitch and a RF 3?
>>
>> --
>> Randy Lynn
>> rl...@getavail.com
>>
>> office:
>> 859.963.1616 <+1-859-963-1616> ext 202
>> 163 East Main Street - Lexington, KY 40507 - USA
>> 
>>
>>  getavail.com 
>>
>>
>
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 <+1-859-963-1616> ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
>
>  getavail.com 
>
>


Re: Re: stream failed when bootstrap

2018-06-27 Thread kurt greaves
Best off trying a rolling restart.

On 28 June 2018 at 03:18, dayu  wrote:

> the output of nodetool describecluster
> Cluster Information:
> Name: online-xxx
> Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
> Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
> Schema versions:
> c3f00d61-1ad7-3702-8703-af2a29e401c1: [10.136.71.43]
>
> 0568e8c1-48ba-3fb0-bb3c-462438978d7b: [10.136.71.33, ]
>
> after I run nodetool resetlocalschema, a error log outcome
>
> ERROR [InternalResponseStage:209417] 2018-06-28 11:14:12,904
> MigrationTask.java:96 - Configuration
> exception merging remote schema
> org.apache.cassandra.exceptions.ConfigurationException: Column family ID
> mismatch (found 5552bba0-2
> dc6-11e8-9b5c-254242d97235; expected 53f6d520-2dc6-11e8-948d-ab7caa3c8c36)
> at 
> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:790)
> ~[apac
> he-cassandra-3.0.10.jar:3.0.10]
> at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:750)
> ~[apache-cassandra-3.0
> .10.jar:3.0.10]
> at org.apache.cassandra.config.Schema.updateTable(Schema.java:661)
> ~[apache-cassandra-3.0.1
> 0.jar:3.0.10]
> at 
> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1348)
> ~[ap
> ache-cassandra-3.0.10.jar:3.0.10]
>
>
>
>
>
> At 2018-06-28 10:01:52, "Jeff Jirsa"  wrote:
>
> You can sometimes bounce your way through it (or use nodetool
> resetlocalschema if it’s a single node that’s wrong), but there are some
> edge cases from which it’s very hard to recover
>
> What’s the output of nodetool describecluster?
>
> If you select from the schema tables, do you see that CFID on any real
> tables?
>
> --
> Jeff Jirsa
>
>
> On Jun 27, 2018, at 7:58 PM, dayu  wrote:
>
> That sound reasonable, I have seen schema mismatch error before.
> So any advise to deal with schema mismatches?
>
> Dayu
>
> At 2018-06-28 09:50:37, "Jeff Jirsa"  wrote:
> >That log message says you did:
> >
> > CF 53f6d520-2dc6-11e8-948d-ab7caa3c8c36 was dropped during streaming
> >
> >If you’re absolutely sure you didn’t, you should look for schema mismatches 
> >in your cluster
> >
> >
> >--
> >Jeff Jirsa
> >
> >
> >> On Jun 27, 2018, at 7:49 PM, dayu  wrote:
> >>
> >> CF 53f6d520-2dc6-11e8-948d-ab7caa3c8c36 was dropped during streaming
> >
> >-
> >To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> >For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>
>
>
>
>
>
>


Re: Re: Re: stream failed when bootstrap

2018-06-28 Thread kurt greaves
Yeah, but you only really need to drain, restart Cassandra one by one. Not
that the others will hurt, but they aren't strictly necessary.

On 28 June 2018 at 05:38, dayu  wrote:

> Hi kurt, a rolling restart means run disablebinary, disablethrift, 
> disablegossip, drain,
> stop cassandra and start cassandra command one by one, right?
> Only one node is executed at a time
>
> Dayu
>
>
>
> At 2018-06-28 11:37:43, "kurt greaves"  wrote:
>
> Best off trying a rolling restart.
>
> On 28 June 2018 at 03:18, dayu  wrote:
>
>> the output of nodetool describecluster
>> Cluster Information:
>> Name: online-xxx
>> Snitch: org.apache.cassandra.locator.DynamicEndpointSnitch
>> Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
>> Schema versions:
>> c3f00d61-1ad7-3702-8703-af2a29e401c1: [10.136.71.43]
>>
>> 0568e8c1-48ba-3fb0-bb3c-462438978d7b: [10.136.71.33, ]
>>
>> after I run nodetool resetlocalschema, a error log outcome
>>
>> ERROR [InternalResponseStage:209417] 2018-06-28 11:14:12,904
>> MigrationTask.java:96 - Configuration
>> exception merging remote schema
>> org.apache.cassandra.exceptions.ConfigurationException: Column family ID
>> mismatch (found 5552bba0-2
>> dc6-11e8-9b5c-254242d97235; expected 53f6d520-2dc6-11e8-948d-ab7caa
>> 3c8c36)
>> at 
>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:790)
>> ~[apac
>> he-cassandra-3.0.10.jar:3.0.10]
>> at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:750)
>> ~[apache-cassandra-3.0
>> .10.jar:3.0.10]
>> at org.apache.cassandra.config.Schema.updateTable(Schema.java:661)
>> ~[apache-cassandra-3.0.1
>> 0.jar:3.0.10]
>> at 
>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1348)
>> ~[ap
>> ache-cassandra-3.0.10.jar:3.0.10]
>>
>>
>>
>>
>>
>> At 2018-06-28 10:01:52, "Jeff Jirsa"  wrote:
>>
>> You can sometimes bounce your way through it (or use nodetool
>> resetlocalschema if it’s a single node that’s wrong), but there are some
>> edge cases from which it’s very hard to recover
>>
>> What’s the output of nodetool describecluster?
>>
>> If you select from the schema tables, do you see that CFID on any real
>> tables?
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Jun 27, 2018, at 7:58 PM, dayu  wrote:
>>
>> That sound reasonable, I have seen schema mismatch error before.
>> So any advise to deal with schema mismatches?
>>
>> Dayu
>>
>> At 2018-06-28 09:50:37, "Jeff Jirsa"  wrote:
>> >That log message says you did:
>> >
>> > CF 53f6d520-2dc6-11e8-948d-ab7caa3c8c36 was dropped during streaming
>> >
>> >If you’re absolutely sure you didn’t, you should look for schema mismatches 
>> >in your cluster
>> >
>> >
>> >--
>> >Jeff Jirsa
>> >
>> >
>> >> On Jun 27, 2018, at 7:49 PM, dayu  wrote:
>> >>
>> >> CF 53f6d520-2dc6-11e8-948d-ab7caa3c8c36 was dropped during streaming
>> >
>> >-
>> >To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> >For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>
>


Re: C* in multiple AWS AZ's

2018-06-29 Thread kurt greaves
Yes. You would just end up with a rack named differently to the AZ. This is
not a problem as racks are just logical. I would recommend migrating all
your DCs to GPFS though for consistency.

On Fri., 29 Jun. 2018, 09:04 Randy Lynn,  wrote:

> So we have two data centers already running..
>
> AP-SYDNEY, and US-EAST.. I'm using Ec2Snitch over a site-to-site tunnel..
> I'm wanting to move the current US-EAST from AZ 1a to 1e..
> I know all docs say use ec2multiregion for multi-DC.
>
> I like the GPFS idea. would that work with the multi-DC too?
> What's the downside? status would report rack of 1a, even though in 1e?
>
> Thanks in advance for the help/thoughts!!
>
>
> On Thu, Jun 28, 2018 at 6:20 PM, kurt greaves 
> wrote:
>
>> There is a need for a repair with both DCs as rebuild will not stream all
>> replicas, so unless you can guarantee you were perfectly consistent at time
>> of rebuild you'll want to do a repair after rebuild.
>>
>> On another note you could just replace the nodes but use GPFS instead of
>> EC2 snitch, using the same rack name.
>>
>> On Fri., 29 Jun. 2018, 00:19 Rahul Singh, 
>> wrote:
>>
>>> Parallel load is the best approach and then switch your Data access code
>>> to only access the new hardware. After you verify that there are no local
>>> read / writes on the OLD dc and that the updates are only via Gossip, then
>>> go ahead and change the replication factor on the key space to have zero
>>> replicas in the old DC. Then you can decommissioned.
>>>
>>> This way you are hundred percent sure that you aren’t missing any new
>>> data. No need for a DC to DC repair but a repair is always healthy.
>>>
>>> Rahul
>>> On Jun 28, 2018, 9:15 AM -0500, Randy Lynn , wrote:
>>>
>>> Already running with Ec2.
>>>
>>> My original thought was a new DC parallel to the current, and then
>>> decommission the other DC.
>>>
>>> Also my data load is small right now.. I know small is relative term..
>>> each node is carrying about 6GB..
>>>
>>> So given the data size, would you go with parallel DC or let the new AZ
>>> carry a heavy load until the others are migrated over?
>>> and then I think "repair" to cleanup the replications?
>>>
>>>
>>> On Thu, Jun 28, 2018 at 10:09 AM, Rahul Singh <
>>> rahul.xavier.si...@gmail.com> wrote:
>>>
>>>> You don’t have to use EC2 snitch on AWS but if you have already started
>>>> with it , it may put a node in a different DC.
>>>>
>>>> If your data density won’t be ridiculous You could add 3 to different
>>>> DC/ Region and then sync up. After the new DC is operational you can remove
>>>> one at a time on the old DC and at the same time add to the new one.
>>>>
>>>> Rahul
>>>> On Jun 28, 2018, 9:03 AM -0500, Randy Lynn , wrote:
>>>>
>>>> I have a 6-node cluster I'm migrating to the new i3 types.
>>>> But at the same time I want to migrate to a different AZ.
>>>>
>>>> What happens if I do the "running node replace method" with 1 node at a
>>>> time moving to the new AZ. Meaning, I'll have temporarily;
>>>>
>>>> 5 nodes in AZ 1c
>>>> 1 new node in AZ 1e.
>>>>
>>>> I'll wash-rinse-repeat till all 6 are on the new machine type and in
>>>> the new AZ.
>>>>
>>>> Any thoughts about whether this gets weird with the Ec2Snitch and a RF
>>>> 3?
>>>>
>>>> --
>>>> Randy Lynn
>>>> rl...@getavail.com
>>>>
>>>> office:
>>>> 859.963.1616 <+1-859-963-1616> ext 202
>>>> 163 East Main Street - Lexington, KY 40507 - USA
>>>> <https://maps.google.com/?q=163+East+Main+Street+-+Lexington,+KY+40507+-+USA=gmail=g>
>>>>
>>>> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>>>>
>>>>
>>>
>>>
>>> --
>>> Randy Lynn
>>> rl...@getavail.com
>>>
>>> office:
>>> 859.963.1616 <+1-859-963-1616> ext 202
>>> 163 East Main Street - Lexington, KY 40507 - USA
>>> <https://maps.google.com/?q=163+East+Main+Street+-+Lexington,+KY+40507+-+USA=gmail=g>
>>>
>>> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>>>
>>>
>
>
> --
> Randy Lynn
> rl...@getavail.com
>
> office:
> 859.963.1616 <+1-859-963-1616> ext 202
> 163 East Main Street - Lexington, KY 40507 - USA
>
> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>


[ANNOUNCE] StratIO's Lucene plugin fork

2018-10-18 Thread kurt greaves
Hi all,

We've had confirmation from Stratio that they are no longer maintaining
their Lucene plugin for Apache Cassandra. We've thus decided to fork the
plugin to continue maintaining it. At this stage we won't be making any
additions to the plugin in the short term unless absolutely necessary, and
as 4.0 nears we'll begin making it compatible with the new major release.
We plan on taking the existing PR's and issues from the Stratio repository
and getting them merged/resolved, however this likely won't happen until
early next year. Having said that, we welcome all contributions and will
dedicate time to reviewing bugs in the current versions if people lodge
them and can help.

I'll note that this is new ground for us, we don't have much existing
knowledge of the plugin but are determined to learn. If anyone out there
has established knowledge about the plugin we'd be grateful for any
assistance!

You can find our fork here:
https://github.com/instaclustr/cassandra-lucene-index
At the moment, the only difference is that there is a 3.11.3 branch which
just has some minor changes to dependencies to better support 3.11.3.

Cheers,
Kurt


Re: Tombstone removal optimization and question

2018-11-06 Thread kurt greaves
Yes it does. Consider if it didn't and you kept writing to the same
partition, you'd never be able to remove any tombstones for that partition.

On Tue., 6 Nov. 2018, 19:40 DuyHai Doan  Hello all
>
> I have tried to sum up all rules related to tombstone removal:
>
>
> --
>
> Given a tombstone written at timestamp (t) for a partition key (P) in
> SSTable (S1). This tombstone will be removed:
>
> 1) after gc_grace_seconds period has passed
> 2) at the next compaction round, if SSTable S1 is selected (not at all
> guaranteed because compaction is not deterministic)
> 3) if the partition key (P) is not present in any other SSTable that is
> NOT picked by the current round of compaction
>
> Rule 3) is quite complex to understand so here is the detailed explanation:
>
> If Partition Key (P) also exists in another SSTable (S2) that is NOT
> compacted together with SSTable (S1), if we remove the tombstone, there is
> some data in S2 that may resurrect.
>
> Precisely, at compaction time, Cassandra does not have ANY detail about
> Partition (P) that stays in S2 so it cannot remove the tombstone right away.
>
> Now, for each SSTable, we have some metadata, namely minTimestamp and
> maxTimestamp.
>
> I wonder if the current compaction optimization does use/leverage this
> metadata for tombstone removal. Indeed if we know that tombstone timestamp
> (t) < minTimestamp, it can be safely removed.
>
> Does someone has the info ?
>
> Regards
>
>
>


Re: SSTableMetadata Util

2018-10-01 Thread kurt greaves
Pranay,

3.11.3 should include all the C* binaries in /usr/bin. Maybe try
reinstalling? Sounds like something got messed up along the way.

Kurt

On Tue, 2 Oct 2018 at 12:45, Pranay akula 
wrote:

> Thanks Christophe,
>
> I have installed using rpm package I actually ran locate command to find
> the sstable utils I could find only those 4
>
> Probably I may need to manually copy them.
>
> Regards
> Pranay
>
> On Mon, Oct 1, 2018, 9:01 PM Christophe Schmitz <
> christo...@instaclustr.com> wrote:
>
>> Hi Pranay,
>>
>> The sstablemetadata is still available in the tarball file
>> ($CASSANDRA_HOME/tools/bin) in 3.11.3. Not sure why it is not available in
>> your packaged installation, you might want to manually copy the one from
>> the package to your /usr/bin/
>>
>> Additionaly, you can have a look at
>> https://github.com/instaclustr/cassandra-sstable-tools which will
>> provided you with the desired info, plus more info you might find useful.
>>
>>
>> Christophe Schmitz - Instaclustr  -
>> Cassandra | Kafka | Spark Consulting
>>
>>
>>
>>
>>
>> On Tue, 2 Oct 2018 at 11:31 Pranay akula 
>> wrote:
>>
>>> Hi,
>>>
>>> I am testing apache 3.11.3 i couldn't find sstablemetadata util
>>>
>>> All i can see is only these utilities in /usr/bin
>>>
>>> -rwxr-xr-x.   1 root root2042 Jul 25 06:12 sstableverify
>>> -rwxr-xr-x.   1 root root2045 Jul 25 06:12 sstableutil
>>> -rwxr-xr-x.   1 root root2042 Jul 25 06:12 sstableupgrade
>>> -rwxr-xr-x.   1 root root2042 Jul 25 06:12 sstablescrub
>>> -rwxr-xr-x.   1 root root2034 Jul 25 06:12 sstableloader
>>>
>>>
>>> If this utility is no longer available how can i get sstable metadata
>>> like repaired_at, Estimated droppable tombstones
>>>
>>>
>>> Thanks
>>> Pranay
>>>
>>


Re: stuck with num_tokens 256

2018-09-22 Thread kurt greaves
If you have problems with balance you can add new nodes using the algorithm
and it'll balance out the cluster. You probably want to stick to 256 tokens
though.
To reduce your # tokens you'll have to do a DC migration (best way). Spin
up a new DC using the algorithm on the nodes and set a lower number of
tokens. You'll want to test first but if you create a new keyspace for the
new DC prior to creation of the new nodes with the desired RF (ie. a
keyspace just in the "new" DC with your RF) then add your nodes using that
keyspace for allocation tokens *should* be distributed evenly amongst that
DC, and when migrate you can decommission the old DC and hopefully end up
with a balanced cluster.
Definitely test beforehand though because that was just me theorising...

I'll note though that if your existing clusters don't have any major issues
it's probably not worth the migration at this point.

On Sat, 22 Sep 2018 at 17:40, onmstester onmstester 
wrote:

> I noticed that currently there is a discussion in ML with
> subject: changing default token behavior for 4.0.
> Any recommendation to guys like me who already have multiple clusters ( >
> 30 nodes in each cluster) with random partitioner and num_tokens = 256?
> I should also add some nodes to existing clusters, is it possible
> with num_tokens = 256?
> How could we fix this bug (reduce num_tokens in existent clusters)?
> Cassandra version: 3.11.2
>
> Sent using Zoho Mail 
>
>
>


Re: stuck with num_tokens 256

2018-09-22 Thread kurt greaves
>
> But one more question, should i use num_tokens : 8 (i would follow
> datastax recommendation) and allocate_tokens_for_local_replication_factor=3
> (which is max RF among my keyspaces) for new clusters which i'm going to
> setup?

16 is probably where it's at. Test beforehand though.

> Is the Allocation algorithm, now recommended algorithm and mature enough
> to replace the Random algorithm? if its so, it should be the default one at
> 4.0?

Let's leave that discussion to the other thread on the dev list.

On Sat, 22 Sep 2018 at 20:35, onmstester onmstester 
wrote:

> Thanks,
> Because all my clusters are already balanced, i won't change their config
> But one more question, should i use num_tokens : 8 (i would follow
> datastax recommendation) and allocate_tokens_for_local_replication_factor=3
> (which is max RF among my keyspaces) for new clusters which i'm going to
> setup?
> Is the Allocation algorithm, now recommended algorithm and mature enough
> to replace the Random algorithm? if its so, it should be the default one at
> 4.0?
>
>
>  On Sat, 22 Sep 2018 13:41:47 +0330 *kurt greaves
> >* wrote 
>
> If you have problems with balance you can add new nodes using the
> algorithm and it'll balance out the cluster. You probably want to stick to
> 256 tokens though.
> To reduce your # tokens you'll have to do a DC migration (best way). Spin
> up a new DC using the algorithm on the nodes and set a lower number of
> tokens. You'll want to test first but if you create a new keyspace for the
> new DC prior to creation of the new nodes with the desired RF (ie. a
> keyspace just in the "new" DC with your RF) then add your nodes using that
> keyspace for allocation tokens *should* be distributed evenly amongst
> that DC, and when migrate you can decommission the old DC and hopefully end
> up with a balanced cluster.
> Definitely test beforehand though because that was just me theorising...
>
> I'll note though that if your existing clusters don't have any major
> issues it's probably not worth the migration at this point.
>
> On Sat, 22 Sep 2018 at 17:40, onmstester onmstester 
> wrote:
>
>
> I noticed that currently there is a discussion in ML with
> subject: changing default token behavior for 4.0.
> Any recommendation to guys like me who already have multiple clusters ( >
> 30 nodes in each cluster) with random partitioner and num_tokens = 256?
> I should also add some nodes to existing clusters, is it possible
> with num_tokens = 256?
> How could we fix this bug (reduce num_tokens in existent clusters)?
> Cassandra version: 3.11.2
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>


Re: stuck with num_tokens 256

2018-09-22 Thread kurt greaves
No, that's not true.

On Sat., 22 Sep. 2018, 21:58 onmstester onmstester, 
wrote:

>
> If you have problems with balance you can add new nodes using the
> algorithm and it'll balance out the cluster. You probably want to stick to
> 256 tokens though.
>
>
> I read somewhere (don't remember the ref) that all nodes of the cluster
> should use the same algorithm, so if my cluster suffer from imbalanced
> nodes using random algorithm i can not add new nodes that are using
> Allocation algorithm. isn't that correct?
>
>
>


Re: TWCS + subrange repair = excessive re-compaction?

2018-09-26 Thread kurt greaves
Not any faster, as you'll still have to wait for all the SSTables to age
off, as a partition level tombstone will simply go to a new SSTable and
likely will not be compacted with the old SSTables.

On Tue, 25 Sep 2018 at 17:03, Martin Mačura  wrote:

> Most partitions in our dataset span one or two SSTables at most.  But
> there might be a few that span hundreds of SSTables.  If I located and
> deleted them (partition-level tombstone), would this fix the issue?
>
> Thanks,
>
> Martin
> On Mon, Sep 24, 2018 at 1:08 PM Jeff Jirsa  wrote:
> >
> >
> >
> >
> > On Sep 24, 2018, at 3:47 AM, Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
> >
> > On Mon, Sep 24, 2018 at 10:50 AM Jeff Jirsa  wrote:
> >>
> >> Do your partitions span time windows?
> >
> >
> > Yes.
> >
> >
> > The data structure used to know if data needs to be streamed (the merkle
> tree) is only granular to - at best - a token, so even with subrange repair
> if a byte is off, it’ll stream the whole partition, including parts of old
> repaired sstables
> >
> > Incremental repair is smart enough not to diff or stream already
> repaired data, the but the matrix of which versions allow subrange AND
> incremental repair isn’t something I’ve memorized (I know it behaves the
> way you’d hope in trunk/4.0 after Cassandra-9143)
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: node replacement failed

2018-09-22 Thread kurt greaves
I don't like your cunning plan. Don't drop the system auth and distributed
keyspaces, instead just change them to NTS and then do your replacement for
each down node.

 If you're actually using auth and worried about consistency I believe 3.11
has the feature to be able to exclude nodes during a repair which you could
use just to repair the auth keyspace.
But if you're not using auth go ahead and change them and then do all your
replaces is the best method of recovery here.

On Sun., 23 Sep. 2018, 00:33 onmstester onmstester, 
wrote:

> Another question,
> Is there a management tool to do nodetool cleanup one by one (wait until
> finish of cleaning up one node then start clean up for the next node in
> cluster)?
>  On Sat, 22 Sep 2018 16:02:17 +0330 *onmstester onmstester
> >* wrote 
>
> I have a cunning plan (Baldrick wise) to solve this problem:
>
>- stop client application
>- run nodetool flush on all nodes to save memtables to disk
>- stop cassandra on all of the nodes
>- rename original Cassandra data directory to data-old
>- start cassandra on all the nodes to create a fresh cluster including
>the old dead nodes
>- again create the application related keyspaces in cqlsh and this
>time set rf=2 on system keyspaces (to never encounter this problem again!)
>- move sstables from data-backup dir to current data dirs and restart
>cassandra or reload sstables
>
>
> Should this work and solve my problem?
>
>
>  On Mon, 10 Sep 2018 17:12:48 +0430 *onmstester onmstester
> >* wrote 
>
>
>
> Thanks Alain,
> First here it is more detail about my cluster:
>
>- 10 racks + 3 nodes on each rack
>- nodetool status: shows 27 nodes UN and 3 nodes all related to single
>rack as DN
>- version 3.11.2
>
> *Option 1: (Change schema and) use replace method (preferred method)*
> * Did you try to have the replace going, without any former repairs,
> ignoring the fact 'system_traces' might be inconsistent? You probably don't
> care about this table, so if Cassandra allows it with some of the nodes
> down, going this way is relatively safe probably. I really do not see what
> you could lose that matters in this table.
> * Another option, if the schema first change was accepted, is to make the
> second one, to drop this table. You can always rebuild it in case you need
> it I assume.
>
> I really love to let the replace going, but it stops with the error:
>
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
>
> Also i could delete system_traces which is empty anyway, but there is a
> system_auth and system_distributed keyspace too and they are not empty,
> Could i delete them safely too?
> If i could just somehow skip streaming the system keyspaces from node
> replace phase, the option 1 would be great.
>
> P.S: Its clear to me that i should use at least RF=3 in production, but
> could not manage to acquire enough resources yet (i hope would be fixed in
> recent future)
>
> Again Thank you for your time
>
> Sent using Zoho Mail 
>
>
>  On Mon, 10 Sep 2018 16:20:10 +0430 *Alain RODRIGUEZ
> >* wrote 
>
>
>
> Hello,
>
> I am sorry it took us (the community) more than a day to answer to this
> rather critical situation. That being said, my recommendation at this point
> would be for you to make sure about the impacts of whatever you would try.
> Working on a broken cluster, as an emergency might lead you to a second
> mistake, possibly more destructive than the first one. It happened to me
> and around, for many clusters. Move forward even more carefuly in these
> situations as a global advice.
>
> Suddenly i lost all disks of cassandar-data on one of my racks
>
>
> With RF=2, I guess operations use LOCAL_ONE consistency, thus you should
> have all the data in the safe rack(s) with your configuration, you probably
> did not lose anything yet and have the service only using the nodes up,
> that got the right data.
>
>  tried to replace the nodes with same ip using this:
>
> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
>
>
> As a side note, I would recommend you to use 'replace_address_first_boot'
> instead of 'replace_address'. This does basically the same but will be
> ignored after the first bootstrap. A detail, but hey, it's there and
> somewhat safer, I would use this one.
>
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
>
> By default, non-user keyspace use 'SimpleStrategy' and a small RF.
> Ideally, this should be changed in a production cluster, and you're having
> an example of why.
>
> Now when i altered the system_traces keyspace startegy to
> NetworkTopologyStrategy and RF=2
> but then running nodetool repair failed: Endpoint not alive /IP of dead
> node that i'm trying to replace.
>
>
> Changing the replication strategy you made the dead 

<    1   2   3   4