from:"Alexander Dejanovski"

Re: stress testing & lab provisioning tools

2024-02-28 Thread Alexander DEJANOVSKI

Hey Jon,

It's awesome to see that you're reviving both these projects!

I was eager to get my hands on an updated version of tlp-cluster with up to
date AMIs 
tlp-stress is by far the best Cassandra stress tool I've worked with, and I
recommend everyone to test easy-cass-stress and build additional workload
types.

Looking forward to testing these new forks.

Alex

Le mar. 27 févr. 2024, 02:00, Jon Haddad  a écrit :

> Hey everyone,
>
> Over the last several months I've put a lot of work into 2 projects I
> started back at The Last Pickle, for stress testing Cassandra and for
> building labs in AWS.  You may know them as tlp-stress and tlp-cluster.
>
> Since I haven't worked at TLP in almost half a decade, and am the primary
> / sole person investing time, I've rebranded them to easy-cass-stress and
> easy-cass-lab.  There's been several major improvements in both projects
> and I invite you to take a look at both of them.
>
> easy-cass-stress
>
> Many of you are familiar with tlp-stress.  easy-cass-stress is a fork /
> rebrand of the project that uses almost the same familiar interface as
> tlp-stress, but with some improvements.  easy-cass-stress is even easier to
> use, requiring less guessing to the parameters to help you figure out your
> performance profile.  Instead of providing a -c flag (for in-flight
> concurrency) you can now simply provide your max read and write latencies
> and it'll figure out the throughput it can get on its own or used fixed
> rate scheduling like many other benchmarking tools have.  The adaptive
> scheduling is based on a Netflix Tech Blog post, but slightly modified to
> be sensitive to latency metrics instead of just errors.   You can read more
> about some of my changes here:
> https://rustyrazorblade.com/post/2023/2023-10-31-tlp-stress-adaptive-scheduler/
>
> GH repo: https://github.com/rustyrazorblade/easy-cass-stress
>
> easy-cass-lab
>
> This is a powerful tool that makes it much easier to spin up lab
> environments using any released version of Cassandra, with functionality
> coming to test custom branches and trunk.  It's a departure from the old
> tlp-cluster that installed and configured everything at runtime.  By
> creating a universal, multi-version AMI complete with all my favorite
> debugging tools, it's now possible to create a lab environment in under 2
> minutes in AWS.  The image includes easy-cass-stress making it
> straightforward to spin up clusters to test existing releases, and soon
> custom builds and trunk.  Fellow committer Jordan West has been working on
> this with me and we've made a ton of progress over the last several weeks.
>  For a demo check out my working session live stream last week where I
> fixed a few issues and discussed the potential and development path for the
> tool: https://youtu.be/dPtsBut7_MM
>
> GH repo: https://github.com/rustyrazorblade/easy-cass-lab
>
> I hope you find these tools as useful as I have.  I am aware of many
> extremely large Cassandra teams using tlp-stress with their 1K+ node
> environments, and hope the additional functionality in easy-cass-stress
> makes it easier for folks to start benchmarking C*, possibly in conjunction
> with easy-cass-lab.
>
> Looking forward to hearing your feedback,
> Jon
>

Re: Switching to Incremental Repair

2024-02-04 Thread Alexander DEJANOVSKI

Hi Sebastian,

That's a feature we need to implement in Reaper. I think disallowing the
start of the new incremental repair would be easier to manage than pausing
the full repair that's already running. It's also what I think I'd expect
as a user.

I'll create an issue to track this.

Le sam. 3 févr. 2024, 16:19, Sebastian Marsching 
a écrit :

> Hi,
>
> 2. use an orchestration tool, such as Cassandra Reaper, to take care of
> that for you. You will still need monitor and alert to ensure the repairs
> are run successfully, but fixing a stuck or failed repair is not very time
> sensitive, you can usually leave it till Monday morning if it happens at
> Friday night.
>
> Does anyone know how such a schedule can be created in Cassandra Reaper?
>
> I recently learned the hard way that running both a full and an
> incremental repair for the same keyspace and table in parallel is not a
> good idea (it caused a very unpleasant overload situation on one of our
> clusters).
>
> At the moment, we have one schedule for the full repairs (every 90 days)
> and another schedule for the incremental repairs (daily). But as full
> repairs take much longer than a day (about a week, in our case), the two
> schedules collide. So, Cassandra Reaper starts an incremental repair while
> the full repair is still in process.
>
> Does anyone know how to avoid this? Optimally, the full repair would be
> paused (no new segments started) for the duration of the incremental
> repair. The second best option would be inhibiting the incremental repair
> while a full repair is in progress.
>
> Best regards,
> Sebastian
>
>

Re: state of incremental repairs in cassandra 3.x

2021-09-17 Thread Alexander DEJANOVSKI

Hi James,

I'd recommend to upgrade to 4.0.1 if you intend to use incremental repair.
The changes from CASSANDRA-9143
 are massive and
couldn't be backported to the 3.11 branch.

When moving to incremental, and in order to limit anticompaction on the
first run, I'd recommend to:
- mark all sstables as repaired
- run a full repair
- schedule very regular (daily) incremental repairs

Bye,

Alex


Le jeu. 16 sept. 2021 à 23:03, C. Scott Andreas  a
écrit :

> Hi James, thanks for reaching out.
>
> A large number of fixes have landed for Incremental Repair in the 3.x
> series, though it's possible some may have been committed to 4.0 without a
> backport. Incremental repair works well on Cassandra 4.0.1. I'd start here
> to ensure you're picking up all fixes that went in, though I do think it's
> likely to work well on a recent 3.0.x build as well (I'm less familiar with
> the 3.11.x series).
>
> – Scott
>
> On Sep 16, 2021, at 1:02 PM, James Brown  wrote:
>
>
> There's been a lot of back and forth on the wider Internet and in this
> mailing list about whether incremental repairs are fatally flawed in
> Cassandra 3.x or whether they're still a good default. What's the current
> best thinking? The most recent 3.x documentation
>  still
> advocates in favor of using incremental repairs...
>
> CASSANDRA-9143  is
> marked as fixed in 4.0; did any improvements make it into any of the 3.11.x
> releases?
>
> If I need the performance of incremental repairs, should I just be
> plotting a 4.0.x upgrade?
>
> --
> James Brown
> Engineer
>
>
>

Re: Backup cassandra and restore. Best practices

2021-04-06 Thread Alexander DEJANOVSKI

Yes, Minio is supported by Medusa through the S3 compatible backend.
I reckon we need to update the docs with a guide on setting up those
backends, but it's pretty much the same as ceph s3 rgw in configuring your
medusa.ini :
- use s3_compatible as storage backend
- set the host, port and region settings appropriately to connect to your
Minio install
- set the "secure" setting to false (libcloud doesn't support ssl on s3
compatible backends)

And you should be good to go.
We also have integration tests that use the Minio backend.

Cheers,

Alex

Le mar. 6 avr. 2021 à 12:33, Erick Ramirez  a
écrit :

> Minio is a supported type --
> https://github.com/apache/libcloud/blob/trunk/libcloud/storage/types.py#L108
>
> On Tue, 6 Apr 2021 at 20:29, Erick Ramirez 
> wrote:
>
>> This is a useful tool, but we look for smth that could store backups in
>>> local S3 (like minio), not Amazon or else..
>>>
>>
>> As I stated in my response, Medusa supports any S3-like storage that the
>> Apache Libcloud API can access. See the docs I linked. Cheers!
>>
>

Re: Anti Compactions while running repair

2020-11-08 Thread Alexander DEJANOVSKI

Only sstables at unrepaired state go through anticompaction.

Le lun. 9 nov. 2020 à 07:01, manish khandelwal 
a écrit :

> Thanks Alex.
>
> One more query, all are sstables (repaired + unrepaired ) part of
> anti-compaction? We are using full repair with -pr option.
>
> Regards
> Manish
>
> On Mon, Nov 9, 2020 at 11:17 AM Alexander DEJANOVSKI <
> adejanov...@gmail.com> wrote:
>
>> Hi Manish,
>>
>> Anticompaction is the same whether you run full or incremental repair.
>>
>>
>> Le ven. 6 nov. 2020 à 04:37, manish khandelwal <
>> manishkhandelwa...@gmail.com> a écrit :
>>
>>> In documentation it is given that while running incremental repairs,
>>> anti compaction is done which results in repaired and unrepaired sstables.
>>> Since anti compaction also runs with full repair and primary range repairs,
>>> I have the following
>>>  question:
>>>
>>> Is anti compaction different in case of full repairs and incremental
>>> repairs?
>>>
>>>
>>> Regards
>>> Manish
>>>
>>

Re: Issue with anti-compaction while running full repair with -pr option

2020-11-08 Thread Alexander DEJANOVSKI

Hi,

You have two options to disable anticompaction when running full repair:

- add the list of DCs using the --dc flag (even if there's just a single DC
in your cluster)
- Use subrange repair, which is done by tools such as Reaper (it can be
challenging to do it yourself on a vnode cluster).

You'll have to mark the sstables which are marked as repaired back to
unrepaired state. This operation requires to stop one node at a time, using
the sstablerepairedset tool (check the official Cassandra docs for more
info).

FTR, Cassandra 4.0 will not perform anticompaction anymore on full repairs.

Cheers,

Alex

Le lun. 9 nov. 2020 à 05:57, Pushpendra Rajpoot <
pushpendra.nh.rajp...@gmail.com> a écrit :

> Hi Team,
>
> In Cassandra 3.x, Anti-compaction is performed after repair (incremental
> or full). Repair does not have any way to bypass anti-compaction (if not
> running sub range repair with -st & -et). Here is a jira ticket.
>
> https://issues.apache.org/jira/browse/CASSANDRA-11511
>
> I am facing 100% disk utilization when running repair -pr after upgrading
> Cassandra from 2.1.16 to 3.11.2. I have a 2TB disk & 1.35TB is used.
>
> There are multiple keyspaces and each keyspace is having multiple tables.
> One of the sstable has huge data. Here is the details of it : Keyspace,
> tabel & SStable have 1.3TB, 730GB, 270 GB of data.
>
> I have following questions :
>
>1. Is there any way to disable anti-compaction after repair is
>completed ?
>2. After what stage, anti-compaction is performed after repair ?
>3. Any other suggestions?
>
> Regards,
> Pushpendra
>

Re: Anti Compactions while running repair

2020-11-08 Thread Alexander DEJANOVSKI

Hi Manish,

Anticompaction is the same whether you run full or incremental repair.


Le ven. 6 nov. 2020 à 04:37, manish khandelwal 
a écrit :

> In documentation it is given that while running incremental repairs, anti
> compaction is done which results in repaired and unrepaired sstables. Since
> anti compaction also runs with full repair and primary range repairs, I
> have the following
>  question:
>
> Is anti compaction different in case of full repairs and incremental
> repairs?
>
>
> Regards
> Manish
>

Re: Tool for schema upgrades

2020-10-08 Thread Alexander DEJANOVSKI

I second Alex's recommendation.
We use https://github.com/patka/cassandra-migration to manage schema
migrations in Reaper and it has a consensus feature to prevent concurrent
migrations from clashing.

Cheers,

Alex

Le jeu. 8 oct. 2020 à 19:10, Alex Ott  a écrit :

> Hi
>
> Look at https://github.com/patka/cassandra-migration - it should be good.
>
> P.S. Here is the list of tools that I assembled over the years:
>
>- [ ] https://github.com/hhandoko/cassandra-migration
>- [ ] https://github.com/Contrast-Security-OSS/cassandra-migration
>- [ ] https://github.com/juxt/joplin
>- [ ] https://github.com/o19s/trireme
>- [ ] https://github.com/golang-migrate/migrate
>- [ ] https://github.com/Cobliteam/cassandra-migrate
>- [ ] https://github.com/patka/cassandra-migration
>- [ ] https://github.com/comeara/pillar
>
> On Thu, Oct 8, 2020 at 5:45 PM Paul Chandler  wrote:
>
>> Hi all,
>>
>> Can anyone recommend a tool to perform schema DDL upgrades, that follows
>> best practice to ensure you don’t get schema mismatches if running multiple
>> upgrade statements in one migration ?
>>
>> Thanks
>>
>> Paul
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>
> --
> With best wishes,Alex Ott
> http://alexott.net/
> Twitter: alexott_en (English), alexott (Russian)
>

Re: How to predict time to complete for nodetool repair

2020-03-23 Thread Alexander DEJANOVSKI

Also Reaper will skip the anticompaction phase which you might be going
through with nodetool (depending on your version of Cassandra).
That'll reduce the overall time spent on repair and will remove some
compaction pressure.

But as Erick said, unless you have past repairs to rely on and a stable
data size, it is impossible to predict the time it takes for repair to
complete.

Cheers,

Alex

Le lun. 23 mars 2020 à 12:44, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> a écrit :

> On Mon, Mar 23, 2020 at 5:49 AM Shishir Kumar 
> wrote:
>
>> Hi,
>>
>> Is it possible to get/predict how much time it will take for *nodetool
>> -pr *to complete on a node? Currently in one of my env (~800GB data per
>> node in 6 node cluster), it is running since last 3 days.
>>
>
> Cassandra Reaper used to provide a reasonably accurate estimate as I
> recall.  Of course, the repair has to be triggered by Reaper itself--it's
> no use if you have already started it with nodetool.
>
> Regards,
> --
> Alex
>
>

Re: How to elect a normal node to a seed node

2020-02-12 Thread Alexander Dejanovski

Seed nodes are special in the sense that other nodes need them for
bootstrap (first startup only) and they have a special place in the Gossip
system. Odds of gossiping to a seed node are higher than other nodes, which
makes them "hubs" of gossip messaging.
Also, they do not bootstrap, so they won't stream data in on their first
start.

Aside from that, any node can become a seed node at anytime. Just update
the seed list on all nodes, roll restart the cluster and you'll have a new
set of seed nodes.

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On Wed, Feb 12, 2020 at 6:48 PM Sergio  wrote:

> So if
> 1) I stop the a Cassandra node that doesn't have in the seeds IP list
> itself
> 2) I change the cassandra.yaml of this node and I add it to the seed list
> 3) I restart the node
>
> It will work completely fine and this is not even necessary.
>
> This means that from the client driver perspective when I define the
> contact points I can specify any node in the cluster as contact point and
> not necessary a seed node?
>
> Best,
>
> Sergio
>
>
> On Wed, Feb 12, 2020, 9:08 AM Arvinder Dhillon 
> wrote:
>
>> I believe seed nodes are not special nodes, it's just that you choose a
>> few nodes from cluster that helps to bootstrap new joining nodes. You can
>> change Cassandra.yaml to make any other node as seed node. There's nothing
>> like promotion.
>>
>> -Arvinder
>>
>> On Wed, Feb 12, 2020, 8:37 AM Sergio  wrote:
>>
>>> Hi guys!
>>>
>>> Is there a way to promote a not seed node to a seed node?
>>>
>>> If yes, how do you do it?
>>>
>>> Thanks!
>>>
>>

Re: What is "will be anticompacted on range" ?

2020-02-10 Thread Alexander Dejanovski

Hi,

Full repair triggers anticompaction as well.
Only subrange repair doesn't trigger anticompaction, and in 4.0, AFAIK,
full repairs won't involve anticompaction anymore.

Cheers,

Le lun. 10 févr. 2020 à 19:17, Krish Donald  a écrit :

> Thanks Jeff, But we are running repair using below command , how do we
> know if incremental repair is enabled?
>
> repair -full -pr
>
> Thanks
> KD
>
> On Mon, Feb 10, 2020 at 10:09 AM Jeff Jirsa  wrote:
>
>> Incremental repair is splitting the data it repaired from the data it
>> didnt repair so it can mark the repaired data with a repairedAt timestamp
>> annotation on the data file / sstable.
>>
>>
>> On Mon, Feb 10, 2020 at 9:39 AM Krish Donald 
>> wrote:
>>
>>> Hi,
>>>
>>> I noticed few messages in system.log like below:
>>> INFO  [CompactionExecutor:21] 2020-02-08 17:56:16,998
>>> CompactionManager.java:677 - [repair #fb044b01-4ab5-11ea-a736-a367dba4ed71]
>>> SSTable BigTableReader(path='xyz/mc-79976-big-Data.db')
>>> ((-8828745000913291684,8954981413747359495]) will be anticompacted on range
>>> (1298637302462891853,1299655718091763872]
>>>
>>> And compactionstats was showing below .
>>> id   compaction type
>>> keyspace table   completedtotalunit  progress
>>> 82ee9720-3c86-11ea-adda-b11edeb80235 Anticompaction after repair
>>> customer profile 182882813624 196589990177 bytes 93.03%
>>>
>>> We are on 3.11.
>>>
>>> What is the meaning of this compaction type  "nticompaction after repair
>>> "?
>>> Havent noticed this in 2.x version
>>>
>>> Thanks
>>> KD
>>>
>>>

Cassandra Reaper 2.0 was released

2019-12-19 Thread Alexander Dejanovski

Hi folks,

I wanted to share with you that Reaper 2.0 was recently released.
It ships with the new sidecar mode (no more externally opened JMX),
Cassandra 4.0 support and a lot of other features and optimizations.

We wrote 2 blog posts that give more information on the new features in 2.0:
https://thelastpickle.com/blog/2019/12/10/cassandra-reaper-2-0-release.html
https://thelastpickle.com/blog/2019/12/18/diagnostics.html

Upgrade is recommended for people using older versions and we're happy to
get your feedback.
Documentation and downloads are available on the official Reaper website
<http://cassandra-reaper.io/>.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: execute is faster than execute_async?

2019-12-11 Thread Alexander Dejanovski

Hi,

you can check this piece of documentation from Datastax:
https://docs.datastax.com/en/developer/python-driver/3.20/api/cassandra/cluster/#cassandra.cluster.Session.execute_async

The usual way of doing this is to send a bunch of execute_async() calls,
adding the returned futures in a list. Once the list reaches the chosen
threshold (usually we send around 100 queries and wait for them to finish
before moving on the the next ones), loop through the futures and call the
result() method to block until it completes.
Should look like this:

futures = []
for i in range(len(queries)):
futures.append(session.execute_async(queries[i]))
if len(futures) >= 100 or i == len(queries)-1:
for future in futures:
results = future.result() # will block until the query finishes
futures = []  # empty the list

Haven't tested the code above but it should give you an idea on how this
can be implemented.
Sending hundreds/thousands of queries without waiting for a result will
DDoS the cluster, so you should always implement some throttling.

Cheers,

-----
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On Wed, Dec 11, 2019 at 10:42 AM Jordan West  wrote:

> I’m not very familiar with the python client unfortunately. If it helps:
> In Java, async would return futures and at the end of submitting each batch
> you would block on them by calling get.
>
> Jordan
>
> On Wed, Dec 11, 2019 at 1:37 AM lampahome  wrote:
>
>>
>>
>> Jordan West  於 2019年12月11日 週三 下午4:34寫道：
>>
>>> Hi,
>>>
>>> Have you tried batching calls to execute_async with periodic blocking
>>> for the batch’s responses?
>>>
>>
>> Can you give me some keywords about calling execute_async batch?
>>
>> PS: I use python version.
>>
>

Medusa : a new OSS backup/restore tool for Apache Cassandra

2019-11-06 Thread Alexander Dejanovski

Hi folks,

I'm happy to announce that Spotify and TLP have been collaborating to
create and open source a new backup and restore tool for Apache Cassandra :
https://github.com/spotify/cassandra-medusa
It is released under the Apache 2.0 license.

It can perform full and differential backups, in place restores (same
cluster) and remote restores (remote cluster) whether or not the topologies
match or not.
More details in our latest blog post :
https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html

Hope you'll enjoy using it,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: about remaining data after adding a node

2019-09-05 Thread Alexander Dejanovski

Hi,

I advise not to run nodetool compact on a TWCS table.
If you do not want to run cleanup and are fine with the extra load on disk
for now, you can wait for data to expire naturally.
It will delete both data that is still owned by the nodes as well as data
that they don't own anymore once the sstables are fully expired.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Thu, Sep 5, 2019 at 11:33 AM Eunsu Kim  wrote:

> Thank you for your response.
>
>
>
> I’m using TimeWindowCompactionStrategy.
>
>
>
> So if I don't run *nodetool compact*, will the remaining data not be
> deleted?
>
>
>
> *From: *Federico Razzoli 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Thursday, 5 September 2019 at 6:19 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: about remaining data after adding a node
>
>
>
> Hi Eunsu,
>
>
>
> Are you using DateTieredCompactionStrategy? It optimises the deletion of
> expired data from disks.
>
> If minor compactions are not solving the problem, I suggest to run
> nodetool compact.
>
>
>
> Federico
>
>
>
>
>
> On Thu, 5 Sep 2019 at 09:51, Eunsu Kim  wrote:
>
>
>
> Hi, all
>
>
>
>
>
> After adding a new node, all the data was streamed by the newly allocated
> token.
>
>
>
>
>
> Since *nodetool cleanup* has not yet been performed on existing nodes,
> the total size has increased.
>
>
>
>
>
> All data has a short ttl. In this case, will the data remaining on the
> existing node be deleted after the end of life? Or should I run *nodetool
> cleanup* to delete it?
>
>
>
>
>
> Thanks in advance.
>
>

Re: Rebuilding a node without clients hitting it

2019-08-06 Thread Alexander Dejanovski

Hi Cyril,

it will depend on the load balancing policy that is used in the client code.

If you're only accessing DC1, with the node being rebuilt living in DC2,
then you need your clients to be using the DCAwareRoundRobinPolicy to
restrict connections to DC1 and avoid all kind of queries hitting DC2.
If clients are accessing both datacenters, and you're not using the
TokenAwarePolicy, even with LOCAL_ONE, the coordinator could pick the node
being rebuilt to process the query.

If you're not spinning up a new datacenter in an existing cluster,
rebuilding a node is not the best way to achieve this without compromising
consistency.
The node should be replaced, which will make it bootstrap safely (he can
replace himself, using the
"-Dcassandra.replace_address_first_boot=" flag.
Bootstrap lets the node stream the data it needs faster than repair would,
while keeping it out of read requests.
The procedure is to stop Cassandra, wipe data, commit log and saved caches,
and then restart it with the JVM flag set in cassandra-env.sh. The node
will appear as joining or down while bootstrapping (it depends if it
replaces itself or another node, can't remember the specifics).
If it shows up as down, it will rely on hints to get the writes. If it
shows as joining, it will get the writes while streaming is ongoing.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On Tue, Aug 6, 2019 at 12:03 PM Cyril Scetbon  wrote:

> Can you elaborate on that ? We use GPFS
> without cassandra-topology.properties.
> —
> Cyril Scetbon
>
> On Aug 5, 2019, at 11:23 PM, Jeff Jirsa  wrote:
>
> some snitch trickery (setting the badness for the rebuilding host) via jmx
>
>
>

Re: What really happened during repair?

2019-08-04 Thread Alexander Dejanovski

Hi Jeff,

Anticompaction only runs before repair in the upcoming 4.0.
In all other versions of Cassandra, it runs at the end of repair sessions.
My understanding from other messages Martin sent to the ML was that he was
already running full repair not incremental, which before 4.0 will also
performs anticompaction (unless you use subrange).

Cheers,



Le dim. 4 août 2019 à 02:29, Jeff Jirsa  a écrit :

>
> > On Aug 3, 2019, at 5:03 PM, Martin Xue  wrote:
> >
> > Hi Cassandra community,
> >
> > I am using Cassandra 3.0.14, 1 cluster, node a,b,c in DC1, node d,e,f in
> DC2.
> >
> > Keyspace_m is 1TB
> >
> > When I run repair -pr a full keyspace_m on node a, what I noticed are:
> > 1. Repair process is running on node a
> > 2. Anti compaction after repair are running on other nodes at least node
> b,d,e,f
> >
> > I want to know
> > 1. why there are anti compactions running after repair?
>
>
> They should run before repair - they split data you’re going to repair
> from data you’re not going to repair
>
> If they’re running after, either there’s another repair command on
> adjacent nodes or you’re repairing multiple key spaces and lost track
>
> > 2. Why it needs to run on other nodes? (I only run primary range repair
> on node a)
>
> Every host involved in the repair will anticompact to split data in the
> range you’re repairing from other data. That means RF number of hosts  will
> run anticompaction for each range you repair
> > 3. What's the purpose of anti compaction after repair?
>
> Answered above , but reminder it’s before
>
> > 4. Can I disable the anti compaction? If so any damage will cause? (It
> takes more than 2 days to run on 1TB keyspace_m, and filled up disk
> quickly, too time and resources consuming)
>
>
> You can run full repair instead of incremental by passing -full
>
> But he cost of anticompaction should go down after the first successful
> incremental repair
>
> >
> > Any suggestions would be appreciated.
> >
> > Thanks
> > Regards
> > Martin
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Repair failed and crash the node, how to bring it back?

2019-07-31 Thread Alexander Dejanovski

Hi Martin,

apparently this is the bug you've been hit by on hints :
https://issues.apache.org/jira/browse/CASSANDRA-14080
It was fixed in 3.0.17.

You didn't provide the logs from Cassandra at the time of the crash, only
the output of nodetool, so it's hard to say what caused it. You may be hit
by this bug: https://issues.apache.org/jira/browse/CASSANDRA-14096
This is unlikely to happen with Reaper (as mentioned in the description of
the ticket) since it will generate smaller Merkle trees as subrange covers
less partitions for each repair session.

So the advice is : upgrade to 3.0.19 (even 3.11.4 IMHO as 3.0 offers less
performance than 3.11) and use Reaper <http://cassandra-reaper.io/> to
handle/schedule repairs.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Thu, Aug 1, 2019 at 12:05 AM Martin Xue  wrote:

> Hi Alex,
>
> Thanks for your reply. The disk space was around 80%. The crash happened
> during repair, primary range full repair on 1TB keyspace.
>
> Would that crash again?
>
> Thanks
> Regards
> Martin
>
> On Thu., 1 Aug. 2019, 12:04 am Alexander Dejanovski, <
> a...@thelastpickle.com> wrote:
>
>> It looks like you have a corrupted hint file.
>> Did the node run out of disk space while repair was running?
>>
>> You might want to move the hint files off their current directory and try
>> to restart the node again.
>> Since you'll have lost mutations then, you'll need... to run repair
>> ¯\_(ツ)_/¯
>>
>> -
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>> On Wed, Jul 31, 2019 at 3:51 PM Martin Xue  wrote:
>>
>>> Hi,
>>>
>>> I am running repair on production, started with one of 6 nodes in the
>>> cluster (3 nodes in each of two DC). Cassandra version 3.0.14.
>>>
>>> running: repair -pr --full keyspace on node 1, 1TB data, takes two days,
>>> and crash,
>>>
>>> error shows:
>>> 3202]] finished (progress: 3%)
>>> Exception occurred during clean-up.
>>> java.lang.reflect.UndeclaredThrowableException
>>> Cassandra has shutdown.
>>> error: [2019-07-31 20:19:20,797] JMX connection closed. You should check
>>> server log for repair status of keyspace keyspace_masked (Subsequent
>>> keyspaces are not going to be repaired).
>>> -- StackTrace --
>>> java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed.
>>> You should check server log for repair status of keyspace keyspace_masked
>>> keyspaces are not going to be repaired).
>>> at
>>> org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97)
>>> at
>>> org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91)
>>> at
>>> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90)
>>> at
>>> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
>>> at
>>> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
>>> at
>>> javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337)
>>> at
>>> javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248)
>>> at
>>> javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441)
>>> at
>>> javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533)
>>> at
>>> javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121)
>>> at
>>> javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534)
>>> at
>>> javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352)
>>> at
>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
>>> at
>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
>>> at
>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471)
>>> at
>>> com.sun.j

Re: Repair failed and crash the node, how to bring it back?

2019-07-31 Thread Alexander Dejanovski

It looks like you have a corrupted hint file.
Did the node run out of disk space while repair was running?

You might want to move the hint files off their current directory and try
to restart the node again.
Since you'll have lost mutations then, you'll need... to run repair
¯\_(ツ)_/¯

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Wed, Jul 31, 2019 at 3:51 PM Martin Xue  wrote:

> Hi,
>
> I am running repair on production, started with one of 6 nodes in the
> cluster (3 nodes in each of two DC). Cassandra version 3.0.14.
>
> running: repair -pr --full keyspace on node 1, 1TB data, takes two days,
> and crash,
>
> error shows:
> 3202]] finished (progress: 3%)
> Exception occurred during clean-up.
> java.lang.reflect.UndeclaredThrowableException
> Cassandra has shutdown.
> error: [2019-07-31 20:19:20,797] JMX connection closed. You should check
> server log for repair status of keyspace keyspace_masked (Subsequent
> keyspaces are not going to be repaired).
> -- StackTrace --
> java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. You
> should check server log for repair status of keyspace keyspace_masked
> keyspaces are not going to be repaired).
> at
> org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97)
> at
> org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91)
> at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90)
> at
> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
> at
> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
> at
> javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337)
> at
> javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248)
> at
> javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441)
> at
> javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533)
> at
> javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121)
> at
> javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534)
> at
> javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>
> system.log shows
> INFO  [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1
> Young Generation GC in 2915ms.  G1 Eden Space: 914358272 -> 0; G1 Old Gen:
> 19043999248 -> 20219035248;
> INFO  [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 - Pool
> NameActive   Pending  Completed   Blocked  All Time
> Blocked
> INFO  [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 -
> MutationStage1915 9578177305 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ViewMutationStage 0 0  0 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ReadStage10 0  219357504 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> RequestResponseStage  1 0  625174550 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> ReadRepairStage   0 02544772 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> CounterMutationStage  0 0  0 0
> 0
>
> INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
> MiscStage 0 0  0 0
> 0
>
> INFO  [Ser

Re: Repair / compaction for 6 nodes, 2 DC cluster

2019-07-31 Thread Alexander Dejanovski

Hi Martin,

you can stop the anticompaction by roll restarting the nodes (not sure if
"nodetool stop COMPACTION" will actually stop anticompaction, I never
tried).

Note that this will leave your cluster with SSTables marked as repaired and
others that are not. These two types of SSTables will never be compacted
together, which can delay reclaiming disk space over time because
overwrites and tombstones won't get merged.
If you plan to stick with nodetool, leave the anticompaction running and
hope that it's just taking a long time because it's your first repair (if
it is your first repair).

Otherwise, and I obviously recommend that, if you choose to use Reaper, you
can stop right away the running anticompactions and prepare for Reaper.
Since Reaper won't trigger anticompactions, you'll have to mark your
SSTables back to unrepaired state so that all SSTables can be compacted
with each other in the future.
To that end, you'll need to use the sstablerepairedset
<https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/tools/toolsSStabRepairedSet.html>
command line tool (ships with Cassandra) and follow the procedure (in a
nutshell, stop Cassandra, mark sstables as unrepaired, restart Cassandra).

Cheers,

-----
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On Wed, Jul 31, 2019 at 3:53 PM Martin Xue  wrote:

> Sorry ASAD, don't have chance, still bogged down with the production
> issue...
>
> On Wed, Jul 31, 2019 at 10:56 PM ZAIDI, ASAD A  wrote:
>
>> Did you get chance to look at tlp reaper tool i.e.
>> http://cassandra-reaper.io/
>>
>> It is pretty awesome – Thanks to TLP team.
>>
>>
>>
>>
>>
>>
>>
>> *From:* Martin Xue [mailto:martin...@gmail.com]
>> *Sent:* Wednesday, July 31, 2019 12:09 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Repair / compaction for 6 nodes, 2 DC cluster
>>
>>
>>
>> Hello,
>>
>>
>>
>> Good day. This is Martin.
>>
>>
>>
>> Can someone help me with the following query regarding Cassandra repair
>> and compaction?
>>
>>
>> Currently we have a large keyspace (keyspace_event) with 1TB of data (in
>> /var/lib/cassandra/data/keyspace_event);
>> There is a cluster with Datacenter 1 contains 3 nodes, Data center 2
>> containing 3 nodes; All together 6 nodes;
>>
>>
>> As part of maintenance, I run the repair on this keyspace with the
>> following command:
>>
>>
>> nodetool repair -pr --full keyspace_event;
>>
>>
>> now it has been run for 2 days. yes 2 days, when doing nodetool tpstats,
>> it shows there is a compaction running:
>>
>>
>> CompactionExecutor1 15783732 0
>>   0
>>
>> nodetool compactionstats shows:
>>
>>
>> pending tasks: 6
>> id   compaction type
>>   keyspace  table   completed
>> totalunit   progress
>>   249ec5f1-b225-11e9-82bd-5b36ef02cadd   Anticompaction after repair
>> keyspace_event table_event   1916937740948   2048931045927   bytes
>> 93.56%
>>
>>
>>
>>
>> Now my questions are:
>> 1. why running repair (with primary range option, -pr, as I want to limit
>> the repair node by node), triggered the compaction running on other nodes?
>> 2. when I run the repair on the second node with nodetool repair -pr
>> --full keyspace_event; will the subsequent compaction run again on all the
>> 6 nodes?
>>
>> I want to know what are the best option to run the repair (full repair)
>> as we did not run it before, especially if it can take less time (in
>> current speed it will take 2 weeks to finish all).
>>
>> I am running Cassandra 3.0.14
>>
>> Any suggestions will be appreciated.
>>
>>
>>
>> Thanks
>>
>> Regards
>>
>> Martin
>>
>>
>>
>

Re: Tombstones not getting purged

2019-06-20 Thread Alexander Dejanovski

Léo,

if a major compaction isn't a viable option, you can give a go at
Instaclustr SSTables tools to target the partitions with the most
tombstones :
https://github.com/instaclustr/cassandra-sstable-tools/tree/cassandra-2.2#ic-purge

It generates a report like this:

Summary:

+-+-+

| | Size|

+-+-+

| Disk|  1.9 GB |

| Reclaim | 11.7 MB |

+-+-+


Largest reclaimable partitions:

+--++-+-+

| Key  | Size   | Reclaim | Generations |

+--++-+-+

| 001.2.340862 | 3.2 kB |  3.2 kB | [534, 438, 498] |

| 001.2.946243 | 2.9 kB |  2.8 kB | [534, 434, 384] |

| 001.1.527557 | 2.8 kB |  2.7 kB | [534, 519, 394] |

| 001.2.181797 | 2.6 kB |  2.6 kB | [534, 424, 343] |

| 001.3.475853 | 2.7 kB |28 B |  [524, 462] |

| 001.0.159704 | 2.7 kB |28 B |  [440, 247] |

| 001.1.311372 | 2.6 kB |28 B |  [424, 458] |

| 001.0.756293 | 2.6 kB |28 B |  [428, 358] |

| 001.2.681009 | 2.5 kB |28 B |  [440, 241] |

| 001.2.474773 | 2.5 kB |28 B |  [524, 484] |

| 001.2.974571 | 2.5 kB |28 B |  [386, 517] |

| 001.0.143176 | 2.5 kB |28 B |  [518, 368] |

| 001.1.185198 | 2.5 kB |28 B |  [517, 386] |

| 001.3.503517 | 2.5 kB |28 B |  [426, 346] |

| 001.1.847384 | 2.5 kB |28 B |  [436, 396] |

| 001.0.949269 | 2.5 kB |28 B |  [516, 356] |

| 001.0.756763 | 2.5 kB |28 B |  [440, 249] |

| 001.3.973808 | 2.5 kB |28 B |  [517, 386] |

| 001.0.312718 | 2.4 kB |28 B |  [524, 467] |

| 001.3.632066 | 2.4 kB |28 B |  [432, 377] |

| 001.1.946590 | 2.4 kB |28 B |  [519, 389] |

| 001.1.798591 | 2.4 kB |28 B |  [434, 388] |

| 001.3.953922 | 2.4 kB |28 B |  [432, 375] |

| 001.2.585518 | 2.4 kB |28 B |  [432, 375] |

| 001.3.284942 | 2.4 kB |28 B |  [376, 432] |

+--++-+-+

Once you've identified these partitions you can run a compaction on the
SSTables that contain them (identified using "nodetool getsstables").
Note that user defined compactions are only available for STCS.
Also ic-purge will perform a compaction but without writing to disk (should
look like a validation compaction), so it is rightfully reported by the
docs as an "intensive process" (not more than a repair though).

-----
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Thu, Jun 20, 2019 at 9:17 AM Alexander Dejanovski 
wrote:

> My bad on date formatting, it should have been : %Y/%m/%d
> Otherwise the SSTables aren't ordered properly.
>
> You have 2 SSTables that claim to cover timestamps from 1940 to 2262,
> which is weird.
> Aside from that, you have big overlaps all over the SSTables, so that's
> probably why your tombstones are sticking around.
>
> Your best shot here will be a major compaction of that table, since it
> doesn't seem so big. Remember to use the --split-output flag on the
> compaction command to avoid ending up with a single SSTable after that.
>
> Cheers,
>
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> On Thu, Jun 20, 2019 at 8:13 AM Léo FERLIN SUTTON
>  wrote:
>
>> On Thu, Jun 20, 2019 at 7:37 AM Alexander Dejanovski <
>> a...@thelastpickle.com> wrote:
>>
>>> Hi Leo,
>>>
>>> The overlapping SSTables are indeed the most probable cause as suggested
>>> by Jeff.
>>> Do you know if the tombstone compactions actually triggered? (did the
>>> SSTables name change?)
>>>
>>
>> Hello !
>>
>> I believe they have changed. I do not remember the sstable name but the
>> "last modified" has changed recently for these tables.
>>
>>
>>> Could you run the following command to list SSTables and provide us the
>>> output? It will display both their timestamp ranges along with the
>>> estimated droppable tombstones ratio.
>>>
>>>
>>> for f in *Data.db; do meta=$(sstablemetadata -gc_grace_seconds 259200
>>> $f); echo $(date --date=@$(echo "$meta" | grep Maximum\ time | cut -d" "
>>> -f3| cut -c 1-10) '+%m/%d/%Y %H:%M:%S') $(date --date=@$(echo "$meta" |
>>> grep Minimum\ time | cut -d" "  -f3| cut -c 1-10) '+%m/%d/%Y %H:%M:%S')
>>> $(echo "$meta" | grep droppable) $(ls -lh $f); done | sort
>>>
>>
>> Here is the results :
>>
>> ```
>> 04/01/2019 22:53:12 03/06/2018 16:46:13 Estimated droppable tombstones:
>

Re: Tombstones not getting purged

2019-06-20 Thread Alexander Dejanovski

My bad on date formatting, it should have been : %Y/%m/%d
Otherwise the SSTables aren't ordered properly.

You have 2 SSTables that claim to cover timestamps from 1940 to 2262, which
is weird.
Aside from that, you have big overlaps all over the SSTables, so that's
probably why your tombstones are sticking around.

Your best shot here will be a major compaction of that table, since it
doesn't seem so big. Remember to use the --split-output flag on the
compaction command to avoid ending up with a single SSTable after that.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Thu, Jun 20, 2019 at 8:13 AM Léo FERLIN SUTTON
 wrote:

> On Thu, Jun 20, 2019 at 7:37 AM Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Leo,
>>
>> The overlapping SSTables are indeed the most probable cause as suggested
>> by Jeff.
>> Do you know if the tombstone compactions actually triggered? (did the
>> SSTables name change?)
>>
>
> Hello !
>
> I believe they have changed. I do not remember the sstable name but the
> "last modified" has changed recently for these tables.
>
>
>> Could you run the following command to list SSTables and provide us the
>> output? It will display both their timestamp ranges along with the
>> estimated droppable tombstones ratio.
>>
>>
>> for f in *Data.db; do meta=$(sstablemetadata -gc_grace_seconds 259200
>> $f); echo $(date --date=@$(echo "$meta" | grep Maximum\ time | cut -d" "
>> -f3| cut -c 1-10) '+%m/%d/%Y %H:%M:%S') $(date --date=@$(echo "$meta" |
>> grep Minimum\ time | cut -d" "  -f3| cut -c 1-10) '+%m/%d/%Y %H:%M:%S')
>> $(echo "$meta" | grep droppable) $(ls -lh $f); done | sort
>>
>
> Here is the results :
>
> ```
> 04/01/2019 22:53:12 03/06/2018 16:46:13 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 16G Apr 13 14:35 md-147916-big-Data.db
> 04/11/2262 23:47:16 10/09/1940 19:13:17 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 218M Jun 20 05:57 md-167948-big-Data.db
> 04/11/2262 23:47:16 10/09/1940 19:13:17 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 2.2G Jun 20 05:57 md-167942-big-Data.db
> 05/01/2019 08:03:24 03/06/2018 16:46:13 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 4.6G May 1 08:39 md-152253-big-Data.db
> 05/09/2018 06:35:03 03/06/2018 16:46:07 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 30G Apr 13 22:09 md-147948-big-Data.db
> 05/21/2019 05:28:01 03/06/2018 16:46:16 Estimated droppable tombstones:
> 0.45150604672159905 -rw-r--r-- 1 cassandra cassandra 1.1G Jun 20 05:55
> md-167943-big-Data.db
> 05/22/2019 11:54:33 03/06/2018 16:46:16 Estimated droppable tombstones:
> 0.30826566640798975 -rw-r--r-- 1 cassandra cassandra 7.6G Jun 20 04:35
> md-167913-big-Data.db
> 06/13/2019 00:02:40 03/06/2018 16:46:08 Estimated droppable tombstones:
> 0.20980847354256815 -rw-r--r-- 1 cassandra cassandra 6.9G Jun 20 04:51
> md-167917-big-Data.db
> 06/17/2019 05:56:12 06/16/2019 20:33:52 Estimated droppable tombstones:
> 0.6114260192855792 -rw-r--r-- 1 cassandra cassandra 257M Jun 20 05:29
> md-167938-big-Data.db
> 06/18/2019 11:21:55 03/06/2018 17:48:22 Estimated droppable tombstones:
> 0.18655813086540254 -rw-r--r-- 1 cassandra cassandra 2.2G Jun 20 05:52
> md-167940-big-Data.db
> 06/19/2019 16:53:04 06/18/2019 11:22:04 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 425M Jun 19 17:08 md-167782-big-Data.db
> 06/20/2019 04:17:22 06/19/2019 16:53:04 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 146M Jun 20 04:18 md-167921-big-Data.db
> 06/20/2019 05:50:23 06/20/2019 04:17:32 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 42M Jun 20 05:56 md-167946-big-Data.db
> 06/20/2019 05:56:03 06/20/2019 05:50:32 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 2 cassandra cassandra 4.8M Jun 20 05:56 md-167947-big-Data.db
> 07/03/2018 17:26:54 03/06/2018 16:46:07 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 27G Apr 13 17:45 md-147919-big-Data.db
> 09/09/2018 18:55:23 03/06/2018 16:46:08 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 30G Apr 13 18:57 md-147926-big-Data.db
> 11/30/2018 11:52:33 03/06/2018 16:46:08 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 14G Apr 13 13:53 md-147908-big-Data.db
> 12/20/2018 07:30:03 03/06/2018 16:46:08 Estimated droppable tombstones:
> 0.0 -rw-r--r-- 1 cassandra cassandra 9.3G Apr 13 13:28 md-147906-big-Data.db
> ```
>
> You could also check the min and max to

Re: Tombstones not getting purged

2019-06-19 Thread Alexander Dejanovski

Hi Leo,

The overlapping SSTables are indeed the most probable cause as suggested by
Jeff.
Do you know if the tombstone compactions actually triggered? (did the
SSTables name change?)

Could you run the following command to list SSTables and provide us the
output? It will display both their timestamp ranges along with the
estimated droppable tombstones ratio.

for f in *Data.db; do meta=$(sstablemetadata -gc_grace_seconds 259200 $f);
echo $(date --date=@$(echo "$meta" | grep Maximum\ time | cut -d" "  -f3|
cut -c 1-10) '+%m/%d/%Y %H:%M:%S') $(date --date=@$(echo "$meta" | grep
Minimum\ time | cut -d" "  -f3| cut -c 1-10) '+%m/%d/%Y %H:%M:%S') $(echo
"$meta" | grep droppable) $(ls -lh $f); done | sort

It will allow to see the timestamp ranges of the SSTables. You could also
check the min and max tokens in each SSTable (not sure if you get that info
from 3.0 sstablemetadata) so that you can detect the SSTables that overlap
on token ranges with the ones that carry the tombstones, and have earlier
timestamps. This way you'll be able to trigger manual compactions,
targeting those specific SSTables.
The rule for a tombstone to be purged is that there is no SSTable outside
the compaction that would possibly contain the partition and that would
have older timestamps.

Is this a followup on your previous issue where you were trying to perform
a major compaction on an LCS table?

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On Thu, Jun 20, 2019 at 7:02 AM Jeff Jirsa  wrote:

> Probably overlapping sstables
>
> Which compaction strategy?
>
>
> > On Jun 19, 2019, at 9:51 PM, Léo FERLIN SUTTON
>  wrote:
> >
> > I have used the following command to check if I had droppable tombstones
> :
> > `/usr/bin/sstablemetadata --gc_grace_seconds 259200
> /var/lib/cassandra/data/stats/tablename/md-sstablename-big-Data.db`
> >
> > I checked every sstable in a loop and had 4 sstables with droppable
> tombstones :
> >
> > ```
> > Estimated droppable tombstones: 0.1558453651124074
> > Estimated droppable tombstones: 0.20980847354256815
> > Estimated droppable tombstones: 0.30826566640798975
> > Estimated droppable tombstones: 0.45150604672159905
> > ```
> >
> > I changed my compaction configuration this morning (via JMX) to force a
> tombstone compaction. These are my settings on this node :
> >
> > ```
> > {
> > "max_threshold":"32",
> > "min_threshold":"4",
> > "unchecked_tombstone_compaction":"true",
> > "tombstone_threshold":"0.1",
> > "class":"org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy"
> > }
> > ```
> > The threshold is lowed than the amount of tombstones in these sstables
> and I expected the setting `unchecked_tombstone_compaction=True` would
> force cassandra to run a "Tombstone Compaction", yet about 24h later all
> the tombstones are still there.
> >
> > ## About the cluster :
> >
> > The compaction backlog is clear and here are our cassandra settings :
> >
> > Cassandra 3.0.18
> > concurrent_compactors: 4
> > compaction_throughput_mb_per_sec: 150
> > sstable_preemptive_open_interval_in_mb: 50
> > memtable_flush_writers: 4
> >
> >
> > Any idea what I might be missing ?
> >
> > Regards,
> >
> > Leo
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Speed up compaction

2019-06-13 Thread Alexander Dejanovski

Hi Léo,

Major compactions in LCS (and minor as well) are very slow indeed and I'm
afraid there's not much you can do to speed things up. There are lots of
synchronized sections in the LCS code and it has to do a lot of comparisons
between sstables to make sure a partition won't end up in two sstables of
the same level.
A major compaction will be single threaded for obvious reasons, and while
this is happening you might have all the newly flushed SSTables that will
pile up in S0 since I don't see how Cassandra could achieve the "one
sstable per partition per level except L0" guarantee otherwise.

At this point, your best chance might be to switch the table to STCS, run a
major compaction using the "-s" flag (split output, which will create one
SSTable per size tier instead of a big fat one) and then back to LCS,
before or after your migration (whatever works best for you). If you go
down that path, I'd also recommend to try it up on one node using JMX to
alter the compaction strategy, run the major compaction with nodetool and
see if it's indeed faster than the LCS major compaction. Then, proceed node
by node using JMX (wait for the major compaction to go through between
nodes) and alter the schema only after the last node has been switched to
STCS.
You can use more "aggressive" compaction settings to limit read
fragmentation reducing max_threshold to 3 instead of 4 (the default).

Note that doing all this will impact your cluster performance in ways I
cannot predict, and should be attempted only if you really need to perform
this major compaction and cannot wait for it to go through at the current
pace.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

On Thu, Jun 13, 2019 at 2:07 PM Léo FERLIN SUTTON
 wrote:

> On Thu, Jun 13, 2019 at 12:09 PM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Thu, Jun 13, 2019 at 11:28 AM Léo FERLIN SUTTON
>>  wrote:
>>
>>>
>>> ## Cassandra configuration :
>>> 4 concurrent_compactors
>>> Current compaction throughput: 150 MB/s
>>> Concurrent reads/write are both set to 128.
>>>
>>> I have also temporarily stopped every repair operations.
>>>
>>> Any ideas about how I can speed this up ?
>>>
>>
>> Hi,
>>
>> What is the compaction strategy used by this column family?
>>
>> Do you observe this behavior on one of the nodes only?  Have you tried to
>> cancel this compaction and see if a new one is started and makes better
>> progress?  Can you try to restart the affected node?
>>
>> Regards,
>> --
>> Alex
>>
>> I can't believe I forgot that information.
>
>  Overall we are talking about a 1.08TB table, using LCS.
>
> SSTable count: 1047
>> SSTables in each level: [15/4, 10, 103/100, 918, 0, 0, 0, 0, 0]
>
> SSTable Compression Ratio: 0.5192269874287099
>
> Number of partitions (estimate): 7282253587
>
>
> We have recently (about a month ago) deleted about 25% of the data in that
> table.
>
> Letting Cassandra reclaim the disk space on it's own (via regular
> compactions) was too slow for us, so we wanted to force a compaction on the
> table to reclaim the disk space faster.
>
> The speed of the compaction doesn't seem out of the ordinary for the
> cluster, only before we haven't had such a big compaction and the speed
> alarmed us.
> We never have a big compaction backlog, most of the time less than 5
> pending tasks (per node)
>
> Finally but we are running Cassandra 3.0.18 and plan to upgrade to 3.11 as
> soon as our compactions are over.
>
> Regards,
>
> Leo
>

Re: TWCS and tombstone purging

2019-03-18 Thread Alexander Dejanovski

Hi Nick,

the strategy will depend on your compaction strategy and how tombstones are
generated (DELETE statements or TTLs), and also your version of Cassandra.

If you're working with TTLs, your best option is definitely TWCS with the
unsafe_aggressive_sstable_expiration flag that was introduced by
CASSANDRA-13418 <https://issues.apache.org/jira/browse/CASSANDRA-13418>.
It'll delete all fully expired SSTables even when there are timestamp
overlaps with other SSTables. If you have different TTLs, you can also
enable *unchecked_tombstone_compaction* to trigger single sstables
compactions more often (and adjust the tombstone_threshold to your
particular workload). You can lower gc_grace_seconds to 3 hours (no less
otherwise you'll reduce the hint window) in order to avoid keeping
tombstones on disk.
That's the easy case.

Then if you're generating tombstones from DELETE statements, it can be
trickier as you'll need the tombstones to be compacted with the data they
shadow in order to get a chance to evict it eventually. You also cannot
reduce gc_grace_seconds below your repair cycle as it will create a
possibility of reviving deleted data (zombie data).
LCS doesn't get along very well with tombstones, as they can get "stuck" in
higher level with the data they shadow being stored in the lower levels.
LCS major compactions are also fairly long to run (and single threaded).
TWCS doesn't apply to data that isn't TTLed (your tombstones will possibly
be stored in a different time window than the data they shadow).
That leaves us with STCS. If you want to be as aggressive as possible there
and purge your deletes ASAP, you'll need to :

   - run repair very often to secure your deletions
   - reduce gc_grace_seconds to a value that's slighty higher than your
   repair cycle
   - run major compactions with the -s flag, in order to avoid creating a
   single big file, and create one file per size tier instead. The best idea
   that I can think of here is to trigger a major compaction right after a
   successful repair.

We have a few posts on our blog <http://thelastpickle.com/blog/> that cover
the tombstones and compaction strategies topic (search for "tombstone" on
that page), notably this one:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Cheers,

On Sat, Mar 16, 2019 at 1:04 AM Nick Hatfield 
wrote:

> Hey guys,
>
>
>
> Can someone give me some idea or link some good material for determining a
> good / aggressive tombstone strategy? I want to make sure my tombstones are
> getting purged as soon as possible to reclaim disk.
>
>
>
> Thanks
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Changing existing Cassandra cluster from single rack configuration to multi racks configuration

2019-03-12 Thread Alexander Dejanovski

Hi Justin,

I'm not sure I follow your reasoning. In a 6 node cluster with 3 racks (2
nodes per rack) and RF 3, if a node goes down you'll still have one node in
each of the other racks to serve the requests. Nodes within the same racks
aren't replicas for the same tokens (as long as the number of racks is
greater or equal to the RF).

Regarding the other question with the decommission/rebootstrap procedure,
unbalances are indeed to be expected, and I'd favor the DC switch
technique, but it may not be an option.

Cheers,


Le mar. 12 mars 2019 à 18:28, Justin Sanciangco
 a écrit :

> I would recommend that you do not go into a 3 rack single dc
> implementation with only 6 nodes. If a node goes down in this situation,
> the node that is paired with the node that is downed will have to service
> all of the load instead of being evenly distributed throughout the cluster.
> While its conceptually nice to have 3 rack implementation, it does have
> some negative implications when not at a proper node count.
>
>
>
> What features are you trying to make use of with going with multirack?
>
>
>
> - Justin Sanciangco
>
>
>
>
>
> *From: *Laxmikant Upadhyay 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, March 11, 2019 at 10:52 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Changing existing Cassandra cluster from single rack
> configuration to multi racks configuration
>
>
>
> Hi Alex,
>
>
>
> Regarding your below point the admin need to take care of temporary uneven
> distribution of data util the entire process is done:
>
>
>
> "If you can't, then I guess you can for each node (one at a time),
> decommission it, wipe it clean and re-bootstrap it after setting the
> appropriate rack."
>
>
>
> I believe while doing so in the existing single rack cluster, the first
> new node joined with different rack (rac2) will get 100% loaded in terms so
> disk usage will be proportionally very high in comparison to other nodes in
> rac1.
>
> So until both racks have equal number of nodes and we run nodetool cleaup,
> the data will not be equally distributed.
>
>
>
>
>
> On Wed, Mar 6, 2019 at 5:50 PM Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
> Hi Manish,
>
>
>
> the best way, if you have the opportunity to easily add new
> hardware/instances, is to create a new DC with racks and switch traffic to
> the new DC when it's ready (then remove the old one). My co-worker Alain
> just wrote a very handy blog post on that technique :
> http://thelastpickle.com/blog/2019/02/26/data-center-switch.html
>
>
>
> If you can't, then I guess you can for each node (one at a time),
> decommission it, wipe it clean and re-bootstrap it after setting the
> appropriate rack.
>
> Also, take into account that your keyspaces must use the
> NetworkTopologyStrategy so that racks can be taken into account. Change the
> strategy prior to adding the new nodes if you're currently using
> SimpleStrategy.
>
>
>
> You cannot (and shouldn't) try to change the rack on an existing node (the
> GossipingPropertyFileSnitch won't allow it).
>
>
>
> Cheers,
>
>
>
> On Wed, Mar 6, 2019 at 12:15 PM manish khandelwal <
> manishkhandelwa...@gmail.com> wrote:
>
> We have a 6 node Cassandra cluster in which all the nodes  are in same
> rack in a dc. We want to take advantage of "multi rack" cluster (example:
> parallel upgrade on all the nodes in same rack without downtime). I would
> like to know what is the recommended process to change an existing cluster
> with single racks configuration to multi rack configuration.
>
>
>
> I want to introduce 3 racks with 2 nodes in each rack.
>
>
>
> Regards
>
> Manish
>
>
>
> --
>
> -
>
> Alexander Dejanovski
>
> France
>
> @alexanderdeja
>
>
>
> Consultant
>
> Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
>
>
>
> --
>
>
>
> regards,
>
> Laxmikant Upadhyay
>
>
>

Re: Changing existing Cassandra cluster from single rack configuration to multi racks configuration

2019-03-06 Thread Alexander Dejanovski

Hi Manish,

the best way, if you have the opportunity to easily add new
hardware/instances, is to create a new DC with racks and switch traffic to
the new DC when it's ready (then remove the old one). My co-worker Alain
just wrote a very handy blog post on that technique :
http://thelastpickle.com/blog/2019/02/26/data-center-switch.html

If you can't, then I guess you can for each node (one at a time),
decommission it, wipe it clean and re-bootstrap it after setting the
appropriate rack.
Also, take into account that your keyspaces must use the
NetworkTopologyStrategy so that racks can be taken into account. Change the
strategy prior to adding the new nodes if you're currently using
SimpleStrategy.

You cannot (and shouldn't) try to change the rack on an existing node (the
GossipingPropertyFileSnitch won't allow it).

Cheers,

On Wed, Mar 6, 2019 at 12:15 PM manish khandelwal <
manishkhandelwa...@gmail.com> wrote:

> We have a 6 node Cassandra cluster in which all the nodes  are in same
> rack in a dc. We want to take advantage of "multi rack" cluster (example:
> parallel upgrade on all the nodes in same rack without downtime). I would
> like to know what is the recommended process to change an existing cluster
> with single racks configuration to multi rack configuration.
>
>
> I want to introduce 3 racks with 2 nodes in each rack.
>
>
> Regards
> Manish
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: [EXTERNAL] Re: Question on changing node IP address

2019-02-27 Thread Alexander Dejanovski

It has to be balanced with the dangers related to the PropertyFileSnitch.
I've seen such incidents happen twice in the last few months in different
places and both times recovery was difficult and hazardous.

I still strongly recommend against it.

On Wed, Feb 27, 2019 at 3:11 PM Durity, Sean R 
wrote:

> We use the PropertyFileSnitch precisely because it is the same on every
> node. If each node has to have a different file (for GPFS) – deployment is
> more complicated. (And for any automated configuration you would have a
> list of hosts and DC/rack information to compile anyway)
>
>
>
> I do put UNKNOWN as the default DC so that any missed node easily appears
> in its own unused DC.
>
>
>
>
>
> Sean Durity
>
>
>
> *From:* Alexander Dejanovski 
> *Sent:* Wednesday, February 27, 2019 4:43 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: Question on changing node IP address
>
>
>
> This snitch is easy to misconfigure. It allows some nodes to have a
> different view of the cluster if they are configured differently, which can
> result in data loss (or at least data that is very hard to recover).
>
> Also it has a nasty feature that allows to set a default DC/Rack. If one
> node isn't properly declared in all the files throughout the cluster, it
> will be seen as part of that "default" DC and then again, it's hard to
> recover.
>
> Be aware that while the GossipingPropertyFileSnitch will not allow
> changing rack of DC for a node that already bootstrapped, the
> PropertyFileSnitch will allow to change it without any notice. So a little
> misconfiguration could merge all nodes from DC1 into DC2, abruptly changing
> token ownership (and it could very be the case that DC1 thinks it's part of
> DC2 but DC2 still thinks DC1 is DC1).
>
> So again, I think this snitch is dangerous and shouldn't be used. The
> GossipingPropertyFileSnitch is much more secure and easy to use.
>
>
>
> Cheers,
>
>
>
>
>
> On Wed, Feb 27, 2019 at 10:13 AM shalom sagges 
> wrote:
>
> If you're using the PropertyFileSnitch, well... you shouldn't as it's a
> rather dangerous and tedious snitch to use
>
>
>
> I inherited Cassandra clusters that use the PropertyFileSnitch. It's been
> working fine, but you've kinda scared me :-)
>
> Why is it dangerous to use?
>
> If I decide to change the snitch, is it seamless or is there a specific
> procedure one must follow?
>
>
>
> Thanks!
>
>
>
>
>
> On Wed, Feb 27, 2019 at 10:08 AM Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
> I confirm what Oleksandr said.
>
> Just stop Cassandra, change the IP, and restart Cassandra.
>
> If you're using the GossipingPropertyFileSnitch, the node will redeclare
> its new IP through Gossip and that's it.
>
> If you're using the PropertyFileSnitch, well... you shouldn't as it's a
> rather dangerous and tedious snitch to use. But if you are, it'll require
> to change the file containing all the IP addresses across the cluster.
>
>
>
> I've been changing IPs on a whole cluster back in 2.1 this way and it went
> through seamlessly.
>
>
>
> Cheers,
>
>
>
> On Wed, Feb 27, 2019 at 8:54 AM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
> On Wed, Feb 27, 2019 at 4:15 AM wxn...@zjqunshuo.com 
> wrote:
>
> >After restart with the new address the server will notice it and log a
> warning, but it will keep token ownership as long as it keeps the old host
> id (meaning it must use the same data directory as before restart).
>
>
>
> Based on my understanding, token range is binded to host id. As long as
> host id doesn't change, everything is ok. Besides data directory, any other
> thing can lead to host id change? And how host id is caculated? For
> example, if I upgrade Cassandra binary to a new version, after restart,
> will host id change?
>
>
>
> I believe host id is calculated once the new node is initialized and never
> changes afterwards, even through major upgrades.  It is stored in system
> keyspace in data directory, and is stable across restarts.
>
>
>
> --
>
> Alex
>
>
>
> --
>
> -
>
> Alexander Dejanovski
>
> France
>
> @alexanderdeja
>
>
>
> Consultant
>
> Apache Cassandra Consulting
>
> http://www.thelastpickle.com
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__www.thelastpickle.com_=DwMFaQ=MtgQEAMQGqekjTjiAhkudQ=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ=xojzDh-fJSOl_ZfDMCYIYi4sckWpwqdnKDG5QMx2nUE=HwAxm8xI-Bmc8IFmwEK0we9hlvNwUVuj7DGpXuNM8r4=>
>
> --
>
> -
>
> Alexander Dejanovski

Re: Question on changing node IP address

2019-02-27 Thread Alexander Dejanovski

Check "nodetool info" on each node to get its rack and DC, and put that
information in *cassandra-rackdc.properties *(even if you have no rack
defined, the nodes are assigned to a default rack and you'll want to use
the same).
Then as you said, change the snitch in the yaml and restart the node.
If you want to play it safe, check "nodetool status" on other nodes in the
cluster after each restart so that you can verify it's still located where
it should be.

Once you've applied the new snitch everywhere delete or rename
*cassandra-topology.properties* to avoid having it used again if there's an
accidental rollback in the yaml.

On Wed, Feb 27, 2019 at 1:32 PM shalom sagges 
wrote:

> Thanks for the info Alex!
>
> I read
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsSwitchSnitch.html
> but still have a few questions:
>
> Our clusters are comprised of 2 DCs with no rack configuration, RF=3 on
> each DC.
> In this scenario, if I wish to seamlessly change the snitch with 0
> downtime, do I need to add the cassandra-rackdc.properties file, change the
> snitch in cassandra.yaml and restart one by one?
> Will this method cause problems?
>
> Thanks!
>
>
> On Wed, Feb 27, 2019 at 12:18 PM Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> You'll be fine with the SimpleSnitch (which shouldn't be used either
>> because it doesn't allow a cluster to use multiple datacenters or racks).
>> Just change the IP and upon restart the node will redeclare itself in the
>> ring. If your node is a seed node, you'll need to update your seed list
>> across the cluster.
>>
>> On Wed, Feb 27, 2019 at 10:52 AM wxn...@zjqunshuo.com <
>> wxn...@zjqunshuo.com> wrote:
>>
>>> I'm using SimpleSnitch. I have only one DC. Is there any problem to
>>> follow the below procedure?
>>>
>>> -Simon
>>>
>>> *From:* Alexander Dejanovski 
>>> *Date:* 2019-02-27 16:07
>>> *To:* user 
>>> *Subject:* Re: Question on changing node IP address
>>>
>>> I confirm what Oleksandr said.
>>> Just stop Cassandra, change the IP, and restart Cassandra.
>>> If you're using the GossipingPropertyFileSnitch, the node will redeclare
>>> its new IP through Gossip and that's it.
>>> If you're using the PropertyFileSnitch, well... you shouldn't as it's a
>>> rather dangerous and tedious snitch to use. But if you are, it'll require
>>> to change the file containing all the IP addresses across the cluster.
>>>
>>> I've been changing IPs on a whole cluster back in 2.1 this way and it
>>> went through seamlessly.
>>>
>>> Cheers,
>>>
>>> On Wed, Feb 27, 2019 at 8:54 AM Oleksandr Shulgin <
>>> oleksandr.shul...@zalando.de> wrote:
>>>
>>>> On Wed, Feb 27, 2019 at 4:15 AM wxn...@zjqunshuo.com <
>>>> wxn...@zjqunshuo.com> wrote:
>>>>
>>>>> >After restart with the new address the server will notice it and log
>>>>> a warning, but it will keep token ownership as long as it keeps the old
>>>>> host id (meaning it must use the same data directory as before restart).
>>>>>
>>>>> Based on my understanding, token range is binded to host id. As long
>>>>> as host id doesn't change, everything is ok. Besides data directory, any
>>>>> other thing can lead to host id change? And how host id is caculated? For
>>>>> example, if I upgrade Cassandra binary to a new version, after restart,
>>>>> will host id change?
>>>>>
>>>>
>>>> I believe host id is calculated once the new node is initialized and
>>>> never changes afterwards, even through major upgrades.  It is stored in
>>>> system keyspace in data directory, and is stable across restarts.
>>>>
>>>> --
>>>> Alex
>>>>
>>>> --
>>> -
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>>> --
>> -
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Question on changing node IP address

2019-02-27 Thread Alexander Dejanovski

You'll be fine with the SimpleSnitch (which shouldn't be used either
because it doesn't allow a cluster to use multiple datacenters or racks).
Just change the IP and upon restart the node will redeclare itself in the
ring. If your node is a seed node, you'll need to update your seed list
across the cluster.

On Wed, Feb 27, 2019 at 10:52 AM wxn...@zjqunshuo.com 
wrote:

> I'm using SimpleSnitch. I have only one DC. Is there any problem to follow
> the below procedure?
>
> -Simon
>
> *From:* Alexander Dejanovski 
> *Date:* 2019-02-27 16:07
> *To:* user 
> *Subject:* Re: Question on changing node IP address
>
> I confirm what Oleksandr said.
> Just stop Cassandra, change the IP, and restart Cassandra.
> If you're using the GossipingPropertyFileSnitch, the node will redeclare
> its new IP through Gossip and that's it.
> If you're using the PropertyFileSnitch, well... you shouldn't as it's a
> rather dangerous and tedious snitch to use. But if you are, it'll require
> to change the file containing all the IP addresses across the cluster.
>
> I've been changing IPs on a whole cluster back in 2.1 this way and it went
> through seamlessly.
>
> Cheers,
>
> On Wed, Feb 27, 2019 at 8:54 AM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Wed, Feb 27, 2019 at 4:15 AM wxn...@zjqunshuo.com <
>> wxn...@zjqunshuo.com> wrote:
>>
>>> >After restart with the new address the server will notice it and log a
>>> warning, but it will keep token ownership as long as it keeps the old host
>>> id (meaning it must use the same data directory as before restart).
>>>
>>> Based on my understanding, token range is binded to host id. As long as
>>> host id doesn't change, everything is ok. Besides data directory, any other
>>> thing can lead to host id change? And how host id is caculated? For
>>> example, if I upgrade Cassandra binary to a new version, after restart,
>>> will host id change?
>>>
>>
>> I believe host id is calculated once the new node is initialized and
>> never changes afterwards, even through major upgrades.  It is stored in
>> system keyspace in data directory, and is stable across restarts.
>>
>> --
>> Alex
>>
>> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Question on changing node IP address

2019-02-27 Thread Alexander Dejanovski

This snitch is easy to misconfigure. It allows some nodes to have a
different view of the cluster if they are configured differently, which can
result in data loss (or at least data that is very hard to recover).
Also it has a nasty feature that allows to set a default DC/Rack. If one
node isn't properly declared in all the files throughout the cluster, it
will be seen as part of that "default" DC and then again, it's hard to
recover.
Be aware that while the GossipingPropertyFileSnitch will not allow changing
rack of DC for a node that already bootstrapped, the PropertyFileSnitch
will allow to change it without any notice. So a little misconfiguration
could merge all nodes from DC1 into DC2, abruptly changing token ownership
(and it could very be the case that DC1 thinks it's part of DC2 but DC2
still thinks DC1 is DC1).
So again, I think this snitch is dangerous and shouldn't be used. The
GossipingPropertyFileSnitch is much more secure and easy to use.

Cheers,

On Wed, Feb 27, 2019 at 10:13 AM shalom sagges 
wrote:

> If you're using the PropertyFileSnitch, well... you shouldn't as it's a
> rather dangerous and tedious snitch to use
>
> I inherited Cassandra clusters that use the PropertyFileSnitch. It's been
> working fine, but you've kinda scared me :-)
> Why is it dangerous to use?
> If I decide to change the snitch, is it seamless or is there a specific
> procedure one must follow?
>
> Thanks!
>
>
> On Wed, Feb 27, 2019 at 10:08 AM Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> I confirm what Oleksandr said.
>> Just stop Cassandra, change the IP, and restart Cassandra.
>> If you're using the GossipingPropertyFileSnitch, the node will redeclare
>> its new IP through Gossip and that's it.
>> If you're using the PropertyFileSnitch, well... you shouldn't as it's a
>> rather dangerous and tedious snitch to use. But if you are, it'll require
>> to change the file containing all the IP addresses across the cluster.
>>
>> I've been changing IPs on a whole cluster back in 2.1 this way and it
>> went through seamlessly.
>>
>> Cheers,
>>
>> On Wed, Feb 27, 2019 at 8:54 AM Oleksandr Shulgin <
>> oleksandr.shul...@zalando.de> wrote:
>>
>>> On Wed, Feb 27, 2019 at 4:15 AM wxn...@zjqunshuo.com <
>>> wxn...@zjqunshuo.com> wrote:
>>>
>>>> >After restart with the new address the server will notice it and log a
>>>> warning, but it will keep token ownership as long as it keeps the old host
>>>> id (meaning it must use the same data directory as before restart).
>>>>
>>>> Based on my understanding, token range is binded to host id. As long as
>>>> host id doesn't change, everything is ok. Besides data directory, any other
>>>> thing can lead to host id change? And how host id is caculated? For
>>>> example, if I upgrade Cassandra binary to a new version, after restart,
>>>> will host id change?
>>>>
>>>
>>> I believe host id is calculated once the new node is initialized and
>>> never changes afterwards, even through major upgrades.  It is stored in
>>> system keyspace in data directory, and is stable across restarts.
>>>
>>> --
>>> Alex
>>>
>>> --
>> -
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Question on changing node IP address

2019-02-27 Thread Alexander Dejanovski

I confirm what Oleksandr said.
Just stop Cassandra, change the IP, and restart Cassandra.
If you're using the GossipingPropertyFileSnitch, the node will redeclare
its new IP through Gossip and that's it.
If you're using the PropertyFileSnitch, well... you shouldn't as it's a
rather dangerous and tedious snitch to use. But if you are, it'll require
to change the file containing all the IP addresses across the cluster.

I've been changing IPs on a whole cluster back in 2.1 this way and it went
through seamlessly.

Cheers,

On Wed, Feb 27, 2019 at 8:54 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Wed, Feb 27, 2019 at 4:15 AM wxn...@zjqunshuo.com 
> wrote:
>
>> >After restart with the new address the server will notice it and log a
>> warning, but it will keep token ownership as long as it keeps the old host
>> id (meaning it must use the same data directory as before restart).
>>
>> Based on my understanding, token range is binded to host id. As long as
>> host id doesn't change, everything is ok. Besides data directory, any other
>> thing can lead to host id change? And how host id is caculated? For
>> example, if I upgrade Cassandra binary to a new version, after restart,
>> will host id change?
>>
>
> I believe host id is calculated once the new node is initialized and never
> changes afterwards, even through major upgrades.  It is stored in system
> keyspace in data directory, and is stable across restarts.
>
> --
> Alex
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: [EXTERNAL] Availability issues for write/update/read workloads (up to 100s downtime) in case of a Cassandra node failure

2018-11-16 Thread Alexander Dejanovski

ormationssystemen (OMI)
>
> Albert-Einstein-Allee 43 
> <https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++89081+Ulm=gmail=g>
>
>
> <https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++89081+Ulm=gmail=g>
>
> 89081 Ulm 
> <https://maps.google.com/?q=Albert-Einstein-Allee+43+%0D%0A+++89081+Ulm=gmail=g>
>
> Phone: +49 (0)731 50-28 799 <+49%20731%205028799>
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
> --
> M.Sc. Daniel Seybold
>
> Universität Ulm
> Institut Organisation und Management
> von Informationssystemen (OMI)Albert-Einstein-Allee 43
> 89081 Ulm 
> <https://maps.google.com/?q=Albert-Einstein-Allee+43%0D%0A89081+Ulm=gmail=g>
> Phone: +49 (0)731 50-28 799 <+49%20731%205028799>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Compacting more than the actual used space

2018-11-05 Thread Alexander Dejanovski

You can check cfstats to see what's the compression ratio.
It's totally possible to have the values you're reporting as a compression
ratio of 0.2 is quite common depending on the data you're storing
(compressed size is then 20% of the original data).

Compaction throughput changes are taken into account for running
compactions starting with Cassandra 2.2 if I'm correct. Your compaction
could be bound by cpu, not I/O in that case.

Cheers

Le lun. 5 nov. 2018 à 20:41, Pedro Gordo  a
écrit :

> Hi
>
> We have an ongoing compaction for roughly 2.5 TB, but "nodetool status"
> reports a load of 1.09 TB. Even if we take into account that the load
> presented by "nodetool status" is the compressed size, I very much doubt
> that compression would work to reduce from 2.5 TB to 1.09.
> We can also take into account that, even if this is the biggest table,
> there are other tables in the system, so the 1.09 TB reported is not just
> for the table being compacted.
>
> What could lead to results like this? We have 4 attached volumes for data
> directories. Could this be a likely cause for such discrepancy?
>
> Bonus question: changing the compaction throughput to 0 (removing the
> throttling), had no impacts in the current compaction. Do new compaction
> throughput values only come into effect when a new compaction kicks in?
>
> Cheers
>
> Pedro Gordo
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Cassandra | Cross Data Centre Replication Status

2018-10-31 Thread Alexander Dejanovski

Akshay,

avoid running repair in that case, it'll take way longer than rebuild and
it will stream data back to your original DC, even between nodes in that
original DC, which is not what you're running after, and could lead to all
sorts of troubles.

Run "nodetool rebuild " as recommended by Jon and Surbhi. All
the data in the original DC will be streamed out to the new one, including
the data that was already written since you altered your keyspace
replication settings (so 2 weeks of data). It will then use some extra disk
space until compaction catches up.

Cheers,


On Wed, Oct 31, 2018 at 2:45 PM Kiran mk  wrote:

> Run the repair with -pr option on each node which will repair only the
> parition range.
>
> nodetool repair -pr
> On Wed, Oct 31, 2018 at 7:04 PM Surbhi Gupta 
> wrote:
> >
> > Nodetool repair will take way more time than nodetool rebuild.
> > How much data u have in your original data center?
> > Repair should be run to make the data consistent in case of node down
> more than hintedhandoff period and dropped mutations.
> > But as a thumb rule ,generally we run repair using opscenter (if using
> Datastax) most of the times.
> >
> > So in your case run “nodetool rebuild ” on all the
> nodes in new data center.
> > For making the rebuild process fast, increase three parameters,
> compaction throughput , stream throughput and interdcstream  throughput.
> >
> > Thanks
> > Surbhi
> > On Tue, Oct 30, 2018 at 11:29 PM Akshay Bhardwaj <
> akshay.bhardwaj1...@gmail.com> wrote:
> >>
> >> Hi Jonathan,
> >>
> >> That makes sense. Thank you for the explanation.
> >>
> >> Another quick question, as the cluster is still operative and the data
> for the past 2 weeks (since updating replication factor) is present in both
> the data centres, should I run "nodetool rebuild" or "nodetool repair"?
> >>
> >> I read that nodetool rebuild is faster and is useful till the new data
> centre is empty and no partition keys are present. So when is the good time
> to use either of the commands and what impact can it have on the data
> centre operations?
> >>
> >> Thanks and Regards
> >>
> >> Akshay Bhardwaj
> >> +91-97111-33849 <+91%2097111%2033849>
> >>
> >>
> >> On Wed, Oct 31, 2018 at 2:34 AM Jonathan Haddad 
> wrote:
> >>>
> >>> You need to run "nodetool rebuild -- " on each node
> in the new DC to get the old data to replicate.  It doesn't do it
> automatically because Cassandra has no way of knowing if you're done adding
> nodes and if it were to migrate automatically, it could cause a lot of
> problems. Imagine streaming 100 nodes data to 3 nodes in the new DC, not
> fun.
> >>>
> >>> On Tue, Oct 30, 2018 at 1:59 PM Akshay Bhardwaj <
> akshay.bhardwaj1...@gmail.com> wrote:
> >>>>
> >>>> Hi Experts,
> >>>>
> >>>> I previously had 1 Cassandra data centre in AWS Singapore region with
> 5 nodes, with my keyspace's replication factor as 3 in Network topology.
> >>>>
> >>>> After this cluster has been running smoothly for 4 months (500 GB of
> data on each node's disk), I added 2nd data centre in AWS Mumbai region
> with yet again 5 nodes in Network topology.
> >>>>
> >>>> After updating my keyspace's replication factor to
> {"AWS_Sgp":3,"AWS_Mum":3}, my expectation was that the data present in Sgp
> region will immediately start replicating on the Mum region's nodes.
> However even after 2 weeks I do not see historical data to be replicated,
> but new data being written on Sgp region is present in Mum region as well.
> >>>>
> >>>> Any help or suggestions to debug this issue will be highly
> appreciated.
> >>>>
> >>>> Regards
> >>>> Akshay Bhardwaj
> >>>> +91-97111-33849 <+91%2097111%2033849>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Jon Haddad
> >>> http://www.rustyrazorblade.com
> >>> twitter: rustyrazorblade
> >>>
> >>>
> >>
> >>
>
>
> --
> Best Regards,
> Kiran.M.K.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: rolling version upgrade, upgradesstables, and vulnerability window

2018-10-30 Thread Alexander Dejanovski

Yes, as the new version can read both the old and the new sstables format.

Restrictions only apply when the cluster is in mixed versions.

On Tue, Oct 30, 2018 at 4:37 PM Carl Mueller
 wrote:

> But the topology change restrictions are only in place while there are
> heterogenous versions in the cluster? All the nodes at the upgraded version
> with "degraded" sstables does NOT preclude topology changes or node
> replacement/addition?
>
>
> On Tue, Oct 30, 2018 at 10:33 AM Jeff Jirsa  wrote:
>
>> Wait for 3.11.4 to be cut
>>
>> I also vote for doing all the binary bounces and upgradesstables after
>> the fact, largely because normal writes/compactions are going to naturally
>> start upgrading sstables anyway, and there are some hard restrictions on
>> mixed mode (e.g. schema changes won’t cross version) that can be far more
>> impactful.
>>
>>
>>
>> --
>> Jeff Jirsa
>>
>>
>> > On Oct 30, 2018, at 8:21 AM, Carl Mueller 
>> > 
>> wrote:
>> >
>> > We are about to finally embark on some version upgrades for lots of
>> clusters, 2.1.x and 2.2.x targetting eventually 3.11.x
>> >
>> > I have seen recipes that do the full binary upgrade + upgrade sstables
>> for 1 node before moving forward, while I've seen a 2016 vote by Jon Haddad
>> (a TLP guy) that backs doing the binary version upgrades through the
>> cluster on a rolling basis, then doing the upgradesstables on a rolling
>> basis.
>> >
>> > Under what cluster conditions are streaming/node replacement precluded,
>> that is we are vulnerable to a cloud provided dumping one of our nodes
>> under us or hardware failure? We ain't apple, but we do have 30+ node
>> datacenters and 80-100 node clusters.
>> >
>> > Is the node replacement and streaming only disabled while there are
>> heterogenous cassandra versions, or until all the sstables have been
>> upgraded in the cluster?
>> >
>> > My instincts tell me the best thing to do is to get all the cassandra
>> nodes to the same version without the upgradesstables step through the
>> cluster, and then roll through the upgradesstables as needed, and that
>> upgradesstables is a node-local concern that doesn't impact streaming or
>> node replacement or other situations since cassandra can read old version
>> sstables and new sstables would simply be the new format.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: rolling version upgrade, upgradesstables, and vulnerability window

2018-10-30 Thread Alexander Dejanovski

Hi Carl,

the safest way is indeed (as suggested by Jon) to upgrade the whole cluster
as quick as possible, and stop all operations that could generate streaming
until all nodes are using the target version.
That includes repair, topology changes (bootstraps, decommissions) and
rebuilds.
You should also avoid all schema changes as they are most probably going to
partially fail in mixed versions clusters.

Run a rolling upgradesstables once the whole cluster is upgraded. You can
(should?) use cstar for that operation as it'll be able to run
upgradesstables with topology awareness, leaving a quorum of replicas free
of the operation at all time.
As upgradesstables will use compaction slots, you could raise your number
of compactors to 4 at least and use "-j 2" to have two slots used by the
upgradesstables. This will leave 2 compactors available for standard
compactions.

Cheers,

Alex, another TLP guy ;)

On Tue, Oct 30, 2018 at 4:21 PM Carl Mueller
 wrote:

> We are about to finally embark on some version upgrades for lots of
> clusters, 2.1.x and 2.2.x targetting eventually 3.11.x
>
> I have seen recipes that do the full binary upgrade + upgrade sstables for
> 1 node before moving forward, while I've seen a 2016 vote by Jon Haddad (a
> TLP guy) that backs doing the binary version upgrades through the cluster
> on a rolling basis, then doing the upgradesstables on a rolling basis.
>
> Under what cluster conditions are streaming/node replacement precluded,
> that is we are vulnerable to a cloud provided dumping one of our nodes
> under us or hardware failure? We ain't apple, but we do have 30+ node
> datacenters and 80-100 node clusters.
>
> Is the node replacement and streaming only disabled while there are
> heterogenous cassandra versions, or until all the sstables have been
> upgraded in the cluster?
>
> My instincts tell me the best thing to do is to get all the cassandra
> nodes to the same version without the upgradesstables step through the
> cluster, and then roll through the upgradesstables as needed, and that
> upgradesstables is a node-local concern that doesn't impact streaming or
> node replacement or other situations since cassandra can read old version
> sstables and new sstables would simply be the new format.
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Best compaction strategy

2018-10-25 Thread Alexander Dejanovski

Hi Raman,

TWCS is the best compaction strategy for TTL data, even if you have
different TTLs (set the time window based on your largest TTL, so it would
be 1 day in your case).
Enable unchecked tombstone compaction to clear the data with 2 days TTL
along the way. This is done by setting :

ALTER TABLE my_table WITH compaction =
{'class':'TimeWindowCompactionStrategy',
'unchecked_tombstone_compaction':'true', ...}

If you're running 3.11.1 at least, you can turn on the
unsafe_aggressive_sstable_expiration introduced by CASSANDRA-13418
<https://issues.apache.org/jira/browse/CASSANDRA-13418>.

Cheers,

On Thu, Oct 25, 2018 at 2:59 PM raman gugnani 
wrote:

> Hi All,
>
> I have one table in which i have some data which has TTL of 2days and some
> data which has TTL of 60 days. What compaction strategy will suits the most.
>
>1. LeveledCompactionStrategy (LCS)
>2. SizeTieredCompactionStrategy (STCS)
>3. TimeWindowCompactionStrategy (TWCS)
>
>
> --
> Raman Gugnani
>
> 8588892293 <(858)%20889-2293>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Multi dc reaper

2018-10-01 Thread Alexander Dejanovski

Hi Abdul,

the thing with multi DC Cassandra clusters is usually that JMX is not
accessible on the cross DC link, which means that one Reaper in a DC cannot
reach the nodes in remote DCs directly.
That's when you need to start Reaper instances in each DC which will sync
up through the Cassandra backend.
if you want one Reaper instance to control multiple DCs with closed JMX
ports, you'll need to set datacenterAvailability to LOCAL, but that will
disable so safety checks and is not recommended.
You can start multiple Reaper instances in the same DC if you want to
achieve HA.
I recommend to check this page to get all the informations about multi DC
setups with Reaper : http://cassandra-reaper.io/docs/usage/multi_dc/

Cheers,

On Sat, Sep 29, 2018 at 6:47 PM Abdul Patel  wrote:

> Is the multidc reaper for load balancing if one goes dpwn another node can
> take care of shchedule repairs or we can actuly schedule repairs at dc
> level woth seperate reaper instances.
> I am planning to have 3 reaper instances in 3 dc .
>
>
> On Friday, September 28, 2018, Abdul Patel  wrote:
>
>> Hi
>>
>> I have 18 node 3 dc cluster, trying to use reaper multi dc concept using
>> datacenteravailabiloty =EACH
>> But is there differnt steps as i start the first instance and add cluster
>> it repairs for full clustrr rather than dc.
>> Am i missing any steps?
>> Also the contact points on this scebario should be only relevabt to that
>> dc?
>>
> --
> You received this message because you are subscribed to the Google Groups
> "TLP Apache Cassandra Reaper users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tlp-apache-cassandra-reaper-users+unsubscr...@googlegroups.com.
> To post to this group, send email to
> tlp-apache-cassandra-reaper-us...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tlp-apache-cassandra-reaper-users/CAHEGkNMRpWnU7MvUsiN3xos1D6CqJXvUWxhX%3DS4ahZFQpfNGLQ%40mail.gmail.com
> <https://groups.google.com/d/msgid/tlp-apache-cassandra-reaper-users/CAHEGkNMRpWnU7MvUsiN3xos1D6CqJXvUWxhX%3DS4ahZFQpfNGLQ%40mail.gmail.com?utm_medium=email_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Large partitions

2018-09-13 Thread Alexander Dejanovski

Hi Gedeon,

you should check Robert Stupp's 2016 talk about large partitions :
https://www.youtube.com/watch?v=N3mGxgnUiRY

Cheers,


On Thu, Sep 13, 2018 at 6:42 PM Gedeon Kamga  wrote:

> Folks,
>
> Based on the information found here
> https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html
>  ,
> the recommended limit for a partition size is 100MB. Even though, DataStax
> clearly states that this is a rule of thumb, some team members are claiming
> that our Cassandra *Write *is very slow because the partitions on some
> tables are over 100MB. I know for a fact that this rule has changed since
> 2.2. Starting Cassandra 2.2 and up, the new rule of thumb for partition
> size is *a few hundreds MB*, given the improvement on the architecture.
> Now, I am unable to find the reference (maybe I got it at a Cassandra
> training by DataStax). I would like to share it with my team. Did anyone
> come across this information? If yes, can you please share it?
>
> Thanks!
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: High IO and poor read performance on 3.11.2 cassandra cluster

2018-09-05 Thread Alexander Dejanovski

gt;> WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-09-04 19:39:51,057
>> NoSpamLogger.java:94 - Out of 29 commit log syncs over the past 263.94s
>> with average duration of 15.17ms, 2 have exceeded the configured commit
>> interval by an average of 20.32ms
>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:39:53,015
>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>> a fresh one
>> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
>> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='system',
>> ColumnFamily='batches') to free up room. Used total: 0.33/0.00, live:
>> 0.33/0.00, flushing: 0.00/0.00, this: 0.13/0.00
>> DEBUG [SlabPoolCleaner] 2018-09-04 19:39:59,659
>> ColumnFamilyStore.java:918 - Enqueuing flush of batches: 128.567MiB (13%)
>> on-heap, 0.000KiB (0%) off-heap
>> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,685
>> Memtable.java:456 - Writing Memtable-batches@1450035834(78.630MiB
>> serialized bytes, 125418 ops, 13%/0% of on/off-heap limit), flushed range =
>> (null, null]
>> DEBUG [PerDiskMemtableFlushWriter_0:12] 2018-09-04 19:39:59,695
>> Memtable.java:485 - Completed flushing
>> /var/lib/cassandra/data/system/batches-919a4bc57a333573b03e13fc3f68b465/mc-16-big-Data.db
>> (0.000KiB) for commitlog position
>> CommitLogPosition(segmentId=1536065319618, position=7958044)
>> DEBUG [MemtableFlushWriter:12] 2018-09-04 19:39:59,695
>> ColumnFamilyStore.java:1216 - Flushed to [] (0 sstables, 0.000KiB), biggest
>> 0.000KiB, smallest 8589934592.000GiB
>> DEBUG [ScheduledTasks:1] 2018-09-04 19:40:15,710 MonitoringTask.java:173
>> - 2 operations were slow in the last 4999 msecs:
>> , time 575 msec -
>> slow timeout 500 msec/cross-node
>> , time 645 msec - slow
>> timeout 500 msec/cross-node
>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:20,475
>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>> a fresh one
>> DEBUG [COMMIT-LOG-ALLOCATOR] 2018-09-04 19:40:46,675
>> AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
>> a fresh one
>> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,976
>> ColumnFamilyStore.java:1308 - Flushing largest CFS(Keyspace='ks',
>> ColumnFamily='xyz') to free up room. Used total: 0.33/0.00, live:
>> 0.33/0.00, flushing: 0.00/0.00, this: 0.12/0.00
>> DEBUG [SlabPoolCleaner] 2018-09-04 19:41:04,977
>> ColumnFamilyStore.java:918 - Enqueuing flush of xyz: 121.374MiB (12%)
>> on-heap, 0.000KiB (0%) off-heap
>>
>> *Observation :*  frequent "operations were slow in the last " (for
>> select queries) and "WARN: commit log syncs over the past"
>> ===
>>
>> *notetool tablestats -H  ks.xyz <http://ks.xyz>;*
>> Total number of tables: 89
>> 
>> Keyspace : ks
>> Read Count: 1439722
>> Read Latency: 1.8982509581710914 ms
>> Write Count: 4222811
>> Write Latency: 0.016324778684151386 ms
>> Pending Flushes: 0
>> Table: xyz
>> SSTable count: 1036
>> SSTables in each level: [1, 10, 116/100, 909, 0, 0, 0, 0,
>> 0]
>> Space used (live): 187.09 GiB
>> Space used (total): 187.09 GiB
>> Space used by snapshots (total): 0 bytes
>> Off heap memory used (total): 783.93 MiB
>> SSTable Compression Ratio: 0.3238726404414842
>> Number of partitions (estimate): 447095605
>> Memtable cell count: 306194
>> Memtable data size: 20.59 MiB
>> Memtable off heap memory used: 0 bytes
>> Memtable switch count: 7
>> Local read count: 1440322
>> Local read latency: 6.785 ms
>> Local write count: 1408204
>> Local write latency: 0.021 ms
>> Pending flushes: 0
>> Percent repaired: 0.0
>> Bloom filter false positives: 19
>> Bloom filter false ratio: 0.3
>> Bloom filter space used: 418.2 MiB
>> Bloom filter off heap memory used: 418.19 MiB
>> Index summary off heap memory used: 307.75 MiB
>> Compression metadata off heap memory used: 57.99 MiB
>> Compacted partition minimum bytes: 150
>> Compacted partition maximum bytes: 1916
>> Compacted partition mean bytes: 1003
>> Average live cells per slice (last five minutes): 20.0
>> Maximum live cells per slice (last five minutes): 20
>> Average tombstones per slice (last five minutes): 1.0
>> Maximum tombstones per slice (last five minutes): 1
>> Dropped Mutations: 0 bytes
>>
>> --
>>
>> regards,
>> Laxmikant Upadhyay
>>
>> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: URGENT: disable reads from node

2018-08-29 Thread Alexander Dejanovski

Kurt is right.

So here are the options I can think of :
- use the join_ring false technique and rely on hints. You'll need to
disable the native transport on the node as well to prevent direct
connections to be made to it. Hopefully, you can run repair in less than 3
hours which is the hint window (hints will be collected while the node
hasn't joined the ring). Otherwise you'll have more consistency issues
after the node joins the ring again. Maybe incremental repair could help
fixing this quickly afterwards if you've been running full repairs that
involved anticompaction (if you're running at least Cassandra 2.2).
- Fully re-bootstrap the node by replacing itself, using the
replace_address_first_boot technique (but since you have RF=2, that would
most probably mean some data loss since you read/write at ONE)
- Try to cheat the dynamic snitch to take the node out of reads. You would
then have the node join the ring normally, disable native transport and
raise Severity (in org.apache.cassandra.db:type=DynamicEndpointSnitch) to
something like 50 so the node won't be selected by the dynamic snitch. I
guess the value will reset itself over time so you may need to set it to 50
on a regular basis while repair is happening.

I would then strongly consider moving to RF=3 because RF=2 will lead you to
this type of situation again in the future and does not allow quorum reads
with fault tolerance.

Good luck,

On Wed, Aug 29, 2018 at 1:56 PM Vlad  wrote:

> I restarted with cassandra.join_ring=false
> nodetool status on other nodes shows this node as DN, while it see itself
> as UN.
>
>
> >I'd say best to just query at QUORUM until you can finish repairs.
> We have RH 2, so I guess QUORUM queries will fail. Also different
> application should be changed for this.
>
>
> On Wednesday, August 29, 2018 2:41 PM, kurt greaves 
> wrote:
>
>
> Note that you'll miss incoming writes if you do that, so you'll be
> inconsistent even after the repair. I'd say best to just query at QUORUM
> until you can finish repairs.
>
> On 29 August 2018 at 21:22, Alexander Dejanovski 
> wrote:
>
> Hi Vlad, you must restart the node but first disable joining the cluster,
> as described in the second part of this blog post :
> http://thelastpickle.com/blog/ 2018/08/02/Re-Bootstrapping-
> Without-Bootstrapping.html
> <http://thelastpickle.com/blog/2018/08/02/Re-Bootstrapping-Without-Bootstrapping.html>
>
> Once repaired, you'll have to run "nodetool join" to start serving reads.
>
>
> Le mer. 29 août 2018 à 12:40, Vlad  a écrit :
>
> Will it help to set read_repair_chance to 1 (compaction is
> SizeTieredCompactionStrategy)?
>
>
> On Wednesday, August 29, 2018 1:34 PM, Vlad 
> wrote:
>
>
> Hi,
>
> quite urgent questions:
> due to disk and C* start problem we were forced to delete commit logs from
> one of nodes.
>
> Now repair is running, but meanwhile some reads bring no data (RF=2)
>
> Can this node be excluded from reads queries? And that  all reads will be
> redirected to other node in the ring?
>
>
> Thanks to All for help.
>
>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: URGENT: disable reads from node

2018-08-29 Thread Alexander Dejanovski

Hi Vlad, you must restart the node but first disable joining the cluster,
as described in the second part of this blog post :
http://thelastpickle.com/blog/2018/08/02/Re-Bootstrapping-Without-Bootstrapping.html

Once repaired, you'll have to run "nodetool join" to start serving reads.


Le mer. 29 août 2018 à 12:40, Vlad  a écrit :

> Will it help to set read_repair_chance to 1 (compaction is
> SizeTieredCompactionStrategy)?
>
>
> On Wednesday, August 29, 2018 1:34 PM, Vlad 
> wrote:
>
>
> Hi,
>
> quite urgent questions:
> due to disk and C* start problem we were forced to delete commit logs from
> one of nodes.
>
> Now repair is running, but meanwhile some reads bring no data (RF=2)
>
> Can this node be excluded from reads queries? And that  all reads will be
> redirected to other node in the ring?
>
>
> Thanks to All for help.
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Incremental repair

2018-08-20 Thread Alexander Dejanovski

Hi Pratchi,

Incremental has been the default since C* 2.2.

You can run a full repair by adding the "--full" flag to your nodetool
command.

Cheers,


Le lun. 20 août 2018 à 19:50, Prachi Rath  a écrit :

> Hi Community,
>
> I am currently creating a new cluster with cassandra 3.11.2 ,while
> enabling repair noticed that incremental repair is true in logfile.
>
>
> (parallelism: parallel, primary range: true, incremental: true, job
> threads: 1, ColumnFamilies: [], dataCenters: [], hosts: [], # of ranges:
> 20, pull repair: false)
>
> i was running repair by -pr option only.
>
> Question:Is incremental repair is the default repair for cassandra 3.11.2
> version.
>
> Thanks,
> Prachi
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Compaction throughput vs. number of compaction threads?

2018-06-05 Thread Alexander Dejanovski

Hi,

The compaction throughput is indeed shared by all compactors.
I would not advise to go below 8MB/s per compactor as slowing down
compactions put more pressure on the heap.

When tuning compaction, the first thing to do is evaluate the maximum
throughput your disks can sustain without impacting p99 read latencies.
Then you can consider raising the number of compactors if you're still
seeing contention.

So the advice would be : don't raise the number of compactors, 4 is
probably enough already and tune the compaction throughput if you're
running on SSDs or if you have an array of HDDs.

Cheers,

On Tue, Jun 5, 2018 at 10:48 AM Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hello,
>
>
>
> most likely obvious and perhaps already answered in the past, but just
> want to be sure …
>
>
>
> E.g. I have set:
>
> concurrent_compactors: 4
>
> compaction_throughput_mb_per_sec: 16
>
>
>
> I guess this will lead to ~ 4MB/s per Thread if I have 4 compactions
> running in parallel?
>
>
>
> So, in case of upscaling a machine and following the recommendation in
> cassandra.yaml I may set:
>
>
>
> concurrent_compactors: 8
>
>
>
>
>
> If this throughput remains unchanged, does this mean that we have 2 MB/s
> per Thread then, e.g. largish compactions running on a single thread taking
> twice the time then?
>
>
>
> Using Cassandra 2.1 and 3.11 in case this matters.
>
>
>
>
>
> Thanks a lot!
>
> Thomas
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Question About Reaper

2018-05-21 Thread Alexander Dejanovski

You won't be able to have less segments than vnodes, so just use 256
segments per node, use parallel as repair parallelism, and set intensity to
1.

You apparently have more than 3TB per node, and that kind of density is
always challenging when it comes to run "fast" repairs.

Cheers,

Le mar. 22 mai 2018 à 07:28, Surbhi Gupta <surbhi.gupt...@gmail.com> a
écrit :

> We are on Dse 4.8.15 and it is cassandra 2.1.
> What are the best configuration to use for reaper for 144 nodes with 256
> vnodes and it shows around 532TB data when we start opscenter repairs.
>
> We need to finish repair soon.
>
> On Mon, May 21, 2018 at 10:53 AM Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi Subri,
>>
>> Reaper might indeed be your best chance to reduce the overhead of vnodes
>> there.
>> The latest betas include a new feature that will group vnodes sharing the
>> same replicas in the same segment. This will allow to have less segments
>> than vnodes, and is available with Cassandra 2.2 and onwards (the
>> improvement is especially beneficial with Cassandra 3.0+ as such token
>> ranges will be repaired in a single session).
>>
>> We have a gitter that you can join if you want to ask questions.
>>
>> Cheers,
>>
>> Le lun. 21 mai 2018 à 15:29, Surbhi Gupta <surbhi.gupt...@gmail.com> a
>> écrit :
>>
>>> Thanks Abdul
>>>
>>> On Mon, May 21, 2018 at 6:28 AM Abdul Patel <abd786...@gmail.com> wrote:
>>>
>>>> We have a paramater in reaper yaml file called
>>>> repairManagerSchrdulingIntervalSeconds default is 10 seconds   , i tested
>>>> with 8,6,5 seconds and found 5 seconds optimal for my environment ..you go
>>>> down further but it will have cascading effects in cpu and memory
>>>> consumption.
>>>> So test well.
>>>>
>>>>
>>>> On Monday, May 21, 2018, Surbhi Gupta <surbhi.gupt...@gmail.com> wrote:
>>>>
>>>>> Thanks a lot for your inputs,
>>>>> Abdul, how did u tune reaper?
>>>>>
>>>>> On Sun, May 20, 2018 at 10:10 AM Jonathan Haddad <j...@jonhaddad.com>
>>>>> wrote:
>>>>>
>>>>>> FWIW the largest deployment I know about is a single reaper instance
>>>>>> managing 50 clusters and over 2000 nodes.
>>>>>>
>>>>>> There might be bigger, but I either don’t know about it or can’t
>>>>>> remember.
>>>>>>
>>>>>> On Sun, May 20, 2018 at 10:04 AM Abdul Patel <abd786...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I recently tested reaper and it actually helped us alot. Even with
>>>>>>> our small footprint 18 node reaper takes close to 6 hrs.>>>>>> 13
>>>>>>> hrs ,i was able to tune it 50%>. But it really depends on number nodes. 
>>>>>>> For
>>>>>>> example if you have 4 nodes then it runs on 4*256 =1024 
>>>>>>> segements ,
>>>>>>> so for your env. Ut will be 256*144 close to 36k segements.
>>>>>>> Better test on poc box how much time it takes and then proceed
>>>>>>> further ..i have tested so far in 1 dc only , we can actually have 
>>>>>>> seperate
>>>>>>> reaper instance handling seperate dc but havent tested it yet.
>>>>>>>
>>>>>>>
>>>>>>> On Sunday, May 20, 2018, Surbhi Gupta <surbhi.gupt...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> We have a cluster with 144 nodes( 3 datacenter) with 256 Vnodes .
>>>>>>>> When we tried to start repairs from opscenter then it showed
>>>>>>>> 1.9Million ranges to repair .
>>>>>>>> And even after doing compaction and strekamthroughput to 0 ,
>>>>>>>> opscenter is not able to help us much to finish repair in 9 days 
>>>>>>>> timeframe .
>>>>>>>>
>>>>>>>> What is your thought on Reaper ?
>>>>>>>> Do you think , Reaper might be able to help us in this scenario ?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>> Surbhi
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>> Jon Haddad
>>>>>> http://www.rustyrazorblade.com
>>>>>> twitter: rustyrazorblade
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>> --
>> -
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Question About Reaper

2018-05-21 Thread Alexander Dejanovski

Hi Subri,

Reaper might indeed be your best chance to reduce the overhead of vnodes
there.
The latest betas include a new feature that will group vnodes sharing the
same replicas in the same segment. This will allow to have less segments
than vnodes, and is available with Cassandra 2.2 and onwards (the
improvement is especially beneficial with Cassandra 3.0+ as such token
ranges will be repaired in a single session).

We have a gitter that you can join if you want to ask questions.

Cheers,

Le lun. 21 mai 2018 à 15:29, Surbhi Gupta <surbhi.gupt...@gmail.com> a
écrit :

> Thanks Abdul
>
> On Mon, May 21, 2018 at 6:28 AM Abdul Patel <abd786...@gmail.com> wrote:
>
>> We have a paramater in reaper yaml file called
>> repairManagerSchrdulingIntervalSeconds default is 10 seconds   , i tested
>> with 8,6,5 seconds and found 5 seconds optimal for my environment ..you go
>> down further but it will have cascading effects in cpu and memory
>> consumption.
>> So test well.
>>
>>
>> On Monday, May 21, 2018, Surbhi Gupta <surbhi.gupt...@gmail.com> wrote:
>>
>>> Thanks a lot for your inputs,
>>> Abdul, how did u tune reaper?
>>>
>>> On Sun, May 20, 2018 at 10:10 AM Jonathan Haddad <j...@jonhaddad.com>
>>> wrote:
>>>
>>>> FWIW the largest deployment I know about is a single reaper instance
>>>> managing 50 clusters and over 2000 nodes.
>>>>
>>>> There might be bigger, but I either don’t know about it or can’t
>>>> remember.
>>>>
>>>> On Sun, May 20, 2018 at 10:04 AM Abdul Patel <abd786...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I recently tested reaper and it actually helped us alot. Even with our
>>>>> small footprint 18 node reaper takes close to 6 hrs.>>>> ,i was able to tune it 50%>. But it really depends on number nodes. For
>>>>> example if you have 4 nodes then it runs on 4*256 =1024 segements 
>>>>> ,
>>>>> so for your env. Ut will be 256*144 close to 36k segements.
>>>>> Better test on poc box how much time it takes and then proceed further
>>>>> ..i have tested so far in 1 dc only , we can actually have seperate reaper
>>>>> instance handling seperate dc but havent tested it yet.
>>>>>
>>>>>
>>>>> On Sunday, May 20, 2018, Surbhi Gupta <surbhi.gupt...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> We have a cluster with 144 nodes( 3 datacenter) with 256 Vnodes .
>>>>>> When we tried to start repairs from opscenter then it showed
>>>>>> 1.9Million ranges to repair .
>>>>>> And even after doing compaction and strekamthroughput to 0 ,
>>>>>> opscenter is not able to help us much to finish repair in 9 days 
>>>>>> timeframe .
>>>>>>
>>>>>> What is your thought on Reaper ?
>>>>>> Do you think , Reaper might be able to help us in this scenario ?
>>>>>>
>>>>>> Thanks
>>>>>> Surbhi
>>>>>>
>>>>>>
>>>>>> --
>>>> Jon Haddad
>>>> http://www.rustyrazorblade.com
>>>> twitter: rustyrazorblade
>>>>
>>>>
>>>>
>>>
>>> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Nodetool repair multiple dc

2018-04-13 Thread Alexander Dejanovski

Hi Abdul,

Reaper has been used in production for several years now, by many companies.
I've seen it handling 100s of clusters and 1000s of nodes with a single
Reaper process.
Check the docs on cassandra-reaper.io to see which architecture matches
your cluster : http://cassandra-reaper.io/docs/usage/multi_dc/

Cheers,

On Fri, Apr 13, 2018 at 4:38 PM Rahul Singh <rahul.xavier.si...@gmail.com>
wrote:

> Makes sense it takes a long time since it has to reconcile against
> replicas in all DCs. I leverage commercial tools for production clusters,
> but I’m pretty sure Reaper is the best open source option. Otherwise you’ll
> waste a lot of time trying to figure it out own your own. No need to
> reinvent the wheel.
>
> On Apr 12, 2018, 11:02 PM -0400, Abdul Patel <abd786...@gmail.com>, wrote:
>
> Hi All,
>
> I have 18 node cluster across 3 dc , if i tey to run incremental repair on
> singke node it takes forever sometome 45 to 1hr and sometime times out ..so
> i started running "nodetool repair -dc dc1" for each dc one by one ..which
> works fine ..do we have an better way to handle this?
> I am thinking abouy exploring cassandra reaper ..does anyone has used that
> in prod?
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Many SSTables only on one node

2018-04-05 Thread Alexander Dejanovski

40 pending compactions is pretty high and you should have way less than
that most of the time, otherwise it means that compaction is not keeping up
with your write rate.

If you indeed have SSDs for data storage, increase your compaction
throughput to 100 or 200 (depending on how the CPUs handle the load). You
can experiment with compaction throughput using : nodetool
setcompactionthroughput 100

You can raise the number of concurrent compactors as well and set it to a
value between 4 and 6 if you have at least 8 cores and CPUs aren't
overwhelmed.

I'm not sure why you ended up with only one node having 6k SSTables and not
the others, but you should apply the above changes so that you can lower
the number of pending compactions and see if it prevents the issue from
happening again.

Cheers,


On Thu, Apr 5, 2018 at 11:33 AM Dmitry Simonov <dimmobor...@gmail.com>
wrote:

> Hi, Alexander!
>
> SizeTieredCompactionStrategy is used for all CFs in problematic keyspace.
> Current compaction throughput is 16 MB/s (default value).
>
> We always have about 40 pending and 2 active "CompactionExecutor" tasks in
> "tpstats".
> Mostly because of another (bigger) keyspace in this cluster.
> But the situation is the same on each node.
>
> According to "nodetool compactionhistory", compactions on this CF run
> (sometimes several times per day, sometimes one time per day, the last run
> was yesterday).
> We run "repair -full" regulary for this keyspace (every 24 hours on each
> node), because gc_grace_seconds is set to 24 hours.
>
> Should we consider increasing compaction throughput and
> "concurrent_compactors" (as recommended for SSDs) to keep
> "CompactionExecutor" pending tasks low?
>
> 2018-04-05 14:09 GMT+05:00 Alexander Dejanovski <a...@thelastpickle.com>:
>
>> Hi Dmitry,
>>
>> could you tell us which compaction strategy that table is currently using
>> ?
>> Also, what is the compaction max throughput and is auto-compaction
>> correctly enabled on that node ?
>>
>> Did you recently run repair ?
>>
>> Thanks,
>>
>> On Thu, Apr 5, 2018 at 10:53 AM Dmitry Simonov <dimmobor...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> Could you please give some ideas on the following problem?
>>>
>>> We have a cluster with 3 nodes, running Cassandra 2.2.11.
>>>
>>> We've recently discovered high CPU usage on one cluster node, after some
>>> investigation we found that number of sstables for one CF on it is very
>>> big: 5800 sstables, on other nodes: 3 sstable.
>>>
>>> Data size in this keyspace was not very big ~100-200Mb per node.
>>>
>>> There is no such problem with other CFs of that keyspace.
>>>
>>> nodetool compact solved the issue as a quick-fix.
>>>
>>> But I'm wondering, what was the cause? How prevent it from repeating?
>>>
>>> --
>>> Best Regards,
>>> Dmitry Simonov
>>>
>> --
>> -
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>
>
>
> --
> Best Regards,
> Dmitry Simonov
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Many SSTables only on one node

2018-04-05 Thread Alexander Dejanovski

Hi Dmitry,

could you tell us which compaction strategy that table is currently using ?
Also, what is the compaction max throughput and is auto-compaction
correctly enabled on that node ?

Did you recently run repair ?

Thanks,

On Thu, Apr 5, 2018 at 10:53 AM Dmitry Simonov <dimmobor...@gmail.com>
wrote:

> Hello!
>
> Could you please give some ideas on the following problem?
>
> We have a cluster with 3 nodes, running Cassandra 2.2.11.
>
> We've recently discovered high CPU usage on one cluster node, after some
> investigation we found that number of sstables for one CF on it is very
> big: 5800 sstables, on other nodes: 3 sstable.
>
> Data size in this keyspace was not very big ~100-200Mb per node.
>
> There is no such problem with other CFs of that keyspace.
>
> nodetool compact solved the issue as a quick-fix.
>
> But I'm wondering, what was the cause? How prevent it from repeating?
>
> --
> Best Regards,
> Dmitry Simonov
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Cassandra upgrade from 2.2.8 to 3.10

2018-03-28 Thread Alexander Dejanovski

You can perform an upgrade from 2.2.x straight to 3.11.2, but the op
suggests adding nodes in 3.10 to a cluster that runs 2.2.8, which is why
Jeff says it won't work.

I see no reason to upgrade to 3.10 and not 3.11.2 by the way.

On Wed, Mar 28, 2018 at 5:10 PM Fred Habash <fmhab...@gmail.com> wrote:

> Hi ...
> I'm finding anecdotal evidence on the internet that we are able to upgrade
> 2.2.8 to latest 3.11.2. Post below indicates that you can upgrade to latest
> 3.x from 2.1.9 because 3.x no longer requires 'structured upgrade path'.
>
> I just want to confirm that such upgrade is supported. If yes, where can I
> find official documentation showing upgrade path across releases.
>
>
> https://stackoverflow.com/questions/42094935/apache-cassandra-upgrade-3-x-from-2-1
>
> Thanks
>
> On Mon, Aug 7, 2017 at 5:58 PM, ZAIDI, ASAD A <az1...@att.com> wrote:
>
>> Hi folks, I’ve question on upgrade method I’m thinking to execute.
>>
>>
>>
>> I’m  planning from apache-Cassandra 2.2.8 to release 3.10.
>>
>>
>>
>> My Cassandra cluster is configured like one rack with two Datacenters
>> like:
>>
>>
>>
>> 1.   DC1 has 4 nodes
>>
>> 2.   DC2 has 16 nodes
>>
>>
>>
>> We’re adding another 12 nodes and would eventually need to remove those 4
>> nodes in DC1.
>>
>>
>>
>> I’m thinking to add another third data center with like DC3 with 12 nodes
>> having apache Cassandra 3.10 installed. Then, I start upgrading seed nodes
>> first in DC1 & DC2 – once all 20nodes in ( DC1 plus DC2) upgraded – I can
>> safely remove 4 DC1 nodes,
>>
>> can you guys please let me know if this approach would work? I’m
>> concerned if having mixed version on Cassandra nodes may  cause any issues
>> like in streaming  data/sstables from existing DC to newly created third DC
>> with version 3.10 installed, will nodes in DC3 join the cluster with data
>> without issues?
>>
>>
>>
>> Thanks/Asad
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
> --
>
>
> Thank you ...
> 
> Fred Habash, Database Solutions Architect (Oracle OCP 8i,9i,10g,11g)
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Node won't start

2018-02-03 Thread Alexander Dejanovski

Hi Brian,

I just tested this on a CCM cluster and the node started without problem.
It flushed some new SSTables a short while after.

I honestly do not know the specifics of how size_estimates is used, but if
it prevented a node from restarting I'd definitely remove the sstables to
get it back up.

Cheers,

On Sat, Feb 3, 2018 at 1:53 PM Brian Spindler <brian.spind...@gmail.com>
wrote:

> Hi guys, I've got a 2.1.15 node that will not start it seems.  Hangs on
> Opening system.size_estimates.  Sometimes it can take a while but I've let
> it run for 90m and nothing.  Should I move this sstable out of the way to
> let it start?  will it rebuild/refresh size estimates if I remove that
> folder?
>
> thanks
> -B
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: need to reclaim space with TWCS

2018-01-20 Thread Alexander Dejanovski

I would turn background read repair off on the table to improve the overlap
issue, but you'll still have foreground read repair if you use quorum reads
anyway.

So put dclocal_... to 0.0.

The commit you're referring to has been merged in 3.11.1 as 2.1 doesn't
patched anymore.

Le sam. 20 janv. 2018 à 16:55, Brian Spindler <brian.spind...@gmail.com> a
écrit :

> Hi Alexander, after re-reading this
> https://issues.apache.org/jira/browse/CASSANDRA-13418 it seems you would
> recommend leaving dclocal_read_repair at maybe 10%  is that true?
>
> Also, has this been patched to 2.1?
> https://github.com/thelastpickle/cassandra/commit/58440e707cd6490847a37dc8d76c150d3eb27aab#diff-e8e282423dcbf34d30a3578c8dec15cdR176
>
>
> Cheers,
>
> -B
>
>
> On Sat, Jan 20, 2018 at 10:49 AM Brian Spindler <brian.spind...@gmail.com>
> wrote:
>
>> Hi Alexander,  Thanks for your response!  I'll give it a shot.
>>
>> On Sat, Jan 20, 2018 at 10:22 AM Alexander Dejanovski <
>> a...@thelastpickle.com> wrote:
>>
>>> Hi Brian,
>>>
>>> You should definitely set unchecked_tombstone_compaction to true and set
>>> the interval to the default of 1 day. Use a tombstone_threshold of 0.6 for
>>> example and see how that works.
>>> Tombstones will get purged depending on your partitioning as their
>>> partition needs to be fully contained within a single sstable.
>>>
>>> Deleting the sstables by hand is theoretically possible but should be
>>> kept as a last resort option if you're running out of space.
>>>
>>> Cheers,
>>>
>>> Le sam. 20 janv. 2018 à 15:41, Brian Spindler <brian.spind...@gmail.com>
>>> a écrit :
>>>
>>>> I probably should have mentioned our setup: we’re on Cassandra version
>>>> 2.1.15.
>>>>
>>>>
>>>> On Sat, Jan 20, 2018 at 9:33 AM Brian Spindler <
>>>> brian.spind...@gmail.com> wrote:
>>>>
>>>>> Hi, I have several column families using TWCS and it’s great.
>>>>> Unfortunately we seem to have missed the great advice in Alex’s article
>>>>> here: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html about
>>>>> setting the appropriate aggressive tombstone settings and now we have lots
>>>>> of timestamp overlaps and disk space to reclaim.
>>>>>
>>>>>
>>>>>
>>>>> I am trying to figure the best way out of this. Lots of the SSTables
>>>>> with overlapping timestamps in newer SSTables have droppable tombstones at
>>>>> like 0.895143957 or something similar, very close to 0.90 where the full
>>>>> sstable will drop afaik.
>>>>>
>>>>>
>>>>>
>>>>> I’m thinking to do the following immediately:
>>>>>
>>>>>
>>>>>
>>>>> Set *unchecked_tombstone_compaction = true*
>>>>>
>>>>> Set* tombstone_compaction_interval == TTL + gc_grace_seconds*
>>>>>
>>>>> Set* dclocal_read_repair_chance = 0.0 (currently 0.1)*
>>>>>
>>>>>
>>>>>
>>>>> If I do this, can I expect TWCS/C* to reclaim the space from those
>>>>> SSTables with 0.89* droppable tombstones?   Or do I (can I?) manually
>>>>> delete these files and will c* just ignore the overlapping data and treat
>>>>> as tombstoned?
>>>>>
>>>>>
>>>>>
>>>>> What else should/could be done?
>>>>>
>>>>>
>>>>>
>>>>> Thank you in advance for your advice,
>>>>>
>>>>>
>>>>>
>>>>> *__*
>>>>>
>>>>> *Brian Spindler *
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>> --
>>> -
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com
>>>
>> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: need to reclaim space with TWCS

2018-01-20 Thread Alexander Dejanovski

Hi Brian,

You should definitely set unchecked_tombstone_compaction to true and set
the interval to the default of 1 day. Use a tombstone_threshold of 0.6 for
example and see how that works.
Tombstones will get purged depending on your partitioning as their
partition needs to be fully contained within a single sstable.

Deleting the sstables by hand is theoretically possible but should be kept
as a last resort option if you're running out of space.

Cheers,

Le sam. 20 janv. 2018 à 15:41, Brian Spindler <brian.spind...@gmail.com> a
écrit :

> I probably should have mentioned our setup: we’re on Cassandra version
> 2.1.15.
>
>
> On Sat, Jan 20, 2018 at 9:33 AM Brian Spindler <brian.spind...@gmail.com>
> wrote:
>
>> Hi, I have several column families using TWCS and it’s great.
>> Unfortunately we seem to have missed the great advice in Alex’s article
>> here: http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html about
>> setting the appropriate aggressive tombstone settings and now we have lots
>> of timestamp overlaps and disk space to reclaim.
>>
>>
>>
>> I am trying to figure the best way out of this. Lots of the SSTables with
>> overlapping timestamps in newer SSTables have droppable tombstones at like
>> 0.895143957 or something similar, very close to 0.90 where the full sstable
>> will drop afaik.
>>
>>
>>
>> I’m thinking to do the following immediately:
>>
>>
>>
>> Set *unchecked_tombstone_compaction = true*
>>
>> Set* tombstone_compaction_interval == TTL + gc_grace_seconds*
>>
>> Set* dclocal_read_repair_chance = 0.0 (currently 0.1)*
>>
>>
>>
>> If I do this, can I expect TWCS/C* to reclaim the space from those
>> SSTables with 0.89* droppable tombstones?   Or do I (can I?) manually
>> delete these files and will c* just ignore the overlapping data and treat
>> as tombstoned?
>>
>>
>>
>> What else should/could be done?
>>
>>
>>
>> Thank you in advance for your advice,
>>
>>
>>
>> *__*
>>
>> *Brian Spindler *
>>
>>
>>
>>
>>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Alter composite column

2018-01-18 Thread Alexander Dejanovski

Compact storage only allows one column outside of the primary key so you'll
definitely need to recreate your table if you want to add columns.

Le jeu. 18 janv. 2018 à 12:18, Nicolas Guyomar <nicolas.guyo...@gmail.com>
a écrit :

> Well it should be as easy as following this :
> https://docs.datastax.com/en/cql/3.1/cql/cql_using/use_alter_add.html
>
> But I'm worried that your initial requirement was to change the clustering
> key, as Alexander stated, you need to create a new table and transfer your
> data in it
>
> On 18 January 2018 at 12:03, Joel Samuelsson <samuelsson.j...@gmail.com>
> wrote:
>
>> It was indeed created with C* 1.X
>> Do you have any links or otherwise on how I would add the column4? I
>> don't want to risk destroying my data.
>>
>> Best regards,
>> Joel
>>
>> 2018-01-18 11:18 GMT+01:00 Nicolas Guyomar <nicolas.guyo...@gmail.com>:
>>
>>> Hi Joel,
>>>
>>> You cannot alter a table primary key.
>>>
>>> You can however alter your existing table to only add column4 using
>>> cqlsh and cql, even if this table as created back with C* 1.X for instance
>>>
>>> On 18 January 2018 at 11:14, Joel Samuelsson <samuelsson.j...@gmail.com>
>>> wrote:
>>>
>>>> So to rephrase that in CQL terms I have a table like this:
>>>>
>>>> CREATE TABLE events (
>>>> key text,
>>>> column1 int,
>>>> column2 int,
>>>> column3 text,
>>>> value text,
>>>> PRIMARY KEY(key, column1, column2, column3)
>>>> ) WITH COMPACT STORAGE
>>>>
>>>> and I'd like to change it to:
>>>> CREATE TABLE events (
>>>> key text,
>>>> column1 int,
>>>> column2 int,
>>>> column3 text,
>>>> column4 text,
>>>> value text,
>>>> PRIMARY KEY(key, column1, column2, column3, column4)
>>>> ) WITH COMPACT STORAGE
>>>>
>>>> Is this possible?
>>>> Best regards,
>>>> Joel
>>>>
>>>> 2018-01-12 16:53 GMT+01:00 Joel Samuelsson <samuelsson.j...@gmail.com>:
>>>>
>>>>> Hi,
>>>>>
>>>>> I have an older system (C* 2.1) using Thrift tables on which I want to
>>>>> alter a column composite. Right now it looks like (int, int, string) but I
>>>>> want it to be (int, int, string, string). Is it possible to do this on a
>>>>> live cluster without deleting the old data? Can you point me to some
>>>>> documentation about this? I can't seem to find it any more.
>>>>>
>>>>> Best regards,
>>>>> Joel
>>>>>
>>>>
>>>>
>>>
>>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Alter composite column

2018-01-18 Thread Alexander Dejanovski

Hi Joel,

Sadly it's not possible to alter the primary key of a table in Cassandra.
That would require to rewrite all data on disk to match the new
partitioning and/or clustering.

You need to create a new table and transfer all data from the old one
programmatically.

Cheers,

Le jeu. 18 janv. 2018 à 11:14, Joel Samuelsson <samuelsson.j...@gmail.com>
a écrit :

> So to rephrase that in CQL terms I have a table like this:
>
> CREATE TABLE events (
> key text,
> column1 int,
> column2 int,
> column3 text,
> value text,
> PRIMARY KEY(key, column1, column2, column3)
> ) WITH COMPACT STORAGE
>
> and I'd like to change it to:
> CREATE TABLE events (
> key text,
> column1 int,
> column2 int,
> column3 text,
> column4 text,
> value text,
> PRIMARY KEY(key, column1, column2, column3, column4)
> ) WITH COMPACT STORAGE
>
> Is this possible?
> Best regards,
> Joel
>
> 2018-01-12 16:53 GMT+01:00 Joel Samuelsson <samuelsson.j...@gmail.com>:
>
>> Hi,
>>
>> I have an older system (C* 2.1) using Thrift tables on which I want to
>> alter a column composite. Right now it looks like (int, int, string) but I
>> want it to be (int, int, string, string). Is it possible to do this on a
>> live cluster without deleting the old data? Can you point me to some
>> documentation about this? I can't seem to find it any more.
>>
>> Best regards,
>> Joel
>>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: New token allocation and adding a new DC

2018-01-17 Thread Alexander Dejanovski

Well, that's a shame...

That part of the code has been changed in trunk and now it uses
BootStrapper.getBootstrapTokens() instead of getRandomToken() when auto
boostrap is disabled :
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/service/StorageService.java#L938

I was hoping this would already be the case in 3.0.x/3.11.x :(
Maybe that change should be backported to 3.11.x ?

It doesn't seem like a big change actually (I can be wrong though,
Cassandra is a complex beast...) and your use case doesn't seem to be that
exotic.
One would expect that a new DC can be created with balanced ownership,
which is obviously not the case.


On Wed, Jan 17, 2018 at 6:27 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Wed, Jan 17, 2018 at 4:21 AM, kurt greaves <k...@instaclustr.com>
> wrote:
>
>> I believe you are able to get away with just altering the keyspace to
>> include both DC's even before the DC exists, and then adding your nodes to
>> that new DC using the algorithm. Note you'll probably want to take the
>> opportunity to reduce the number of vnodes to something reasonable. Based
>> off memory from previous testing you can get a good token balance with 16
>> vnodes if you have at least 6 nodes per rack (with RF=3 and 3 racks).
>>
>
> Alexander, Kurt,
>
> Thank you for the suggestions.
>
> None of them did work in the end, unfortunately:
>
> 1. Using auto_bootstrap=false always results in random token allocation,
> ignoring the allocate_tokens_for_keyspace option.
>
> The token allocation option is only considered if shouldBootstrap()
> returns true:
>
> https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L790
> if (shouldBootstrap()) {
>
> https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L842
>   BootStrapper.getBootstrapTokens()  (the only place in code using the
> token allocation option)
>
> https://github.com/apache/cassandra/blob/cassandra-3.0.15/src/java/org/apache/cassandra/service/StorageService.java#L901
> else { ...
>
> 2. Using auto_bootstrap=true and allocate_tokens_for_keyspace=data_ks
> gives us balanced range ownership on the new empty DC.  The problem though,
> is that rebuilding of an already bootstrapped node doesn't work: the node
> believes that it already has all the data.
>
> We are going to proceed by manually assigning a small number of tokens to
> the nodes in new DC with auto_bootstrap=false and only use the automatic
> token allocation when we need to scale it out.  This seems to be the only
> supported way to use it anyway.
>
> Regards,
> --
> Alex
>
>

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: TWCS and autocompaction

2018-01-16 Thread Alexander Dejanovski

The ticket I was referring to is the following :
https://issues.apache.org/jira/browse/CASSANDRA-13418

It's been merged in 3.11.1, so just make sure you enable
unsafe_aggressive_sstable_expiration and you'll evict expired SSTables
regardless of overlaps (and IMHO it's totally safe to do this).
Do not ever run major compactions on TWCS tables unless you have a really,
really valid reason, and do not ever disable autocompaction on any table
for a long time.

Foreground read repair will still happen, regardless your settings, when
reading at QUORUM or LOCAL_QUORUM, that's just part of the read path.
read_repair_chance and dc_read_repair_chance set to 0.0 will only disable
background read repair, which also happens at other consistency levels.

Currently, you have a default TTL of 1555200 and a 4 hours time window,
which can create up to 108 live buckets.
The advice Jeff Jirsa gave back in the days is to try to keep the number of
live buckets between 50 and 60, which means you should double the size of
your time windows to 8 hours.

If you end up with 100 SSTables, then TWCS is properly doing its work,
keeping in mind that the current time window can/will have more than one
SSTable. Major compaction within a bucket will happen once it gets out of
the current time window.

Cheers,


On Tue, Jan 16, 2018 at 7:16 PM Cogumelos Maravilha <
cogumelosmaravi...@sapo.pt> wrote:

> Hi,
>
> My read_repair_chance is 0 (AND read_repair_chance = 0.0)
>
> When I bootstrap a new node there is around 700 sstables, but after auto
> compaction the number drop to around 100.
>
> I'm using C* 3.11.1. To solve the problem I've already changed to
> 'unchecked_tombstone_compaction': 'true'. Now should I run nodetool compact?
>
> And for the future crontab nodetool disableautocompaction?
>
> Thanks
>
> On 16-01-2018 11:35, Alexander Dejanovski wrote:
>
> Hi,
>
> The overlaps you're seeing on time windows aren't due to automatic
> compactions, but to read repairs.
> You must be reading at quorum or local_quorum which can perform foreground
> read repair in case of digest mismatch.
>
> You can set unchecked_tombstone_compaction to true if you want to perform
> single sstable compaction to purge tombstones and a patch has recently been
> merged in to allow twcs to delete fully expired data even in case of
> overlap between time windows (I can't remember if it's been merged in
> 3.11.1).
> Just so you know, the timestamp considered for time windows is the max
> timestamp. You can have old data in recent time windows, but not the
> opposite.
>
> Cheers,
>
> Le mar. 16 janv. 2018 à 12:07, Cogumelos Maravilha <
> cogumelosmaravi...@sapo.pt> a écrit :
>
>> Hi list,
>>
>> My settings:
>>
>> AND compaction = {'class':
>> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
>> 'compaction_window_size': '4', 'compaction_window_unit': 'HOURS',
>> 'enabled': 'true', 'max_threshold': '64', 'min_threshold': '2',
>> 'tombstone_compaction_interval': '15000', 'tombstone_threshold': '0.2',
>> 'unchecked_tombstone_compaction': 'false'}
>> AND compression = {'chunk_length_in_kb': '64', 'class':
>> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>> AND crc_check_chance = 0.0
>> AND dclocal_read_repair_chance = 0.0
>> AND default_time_to_live = 1555200
>> AND gc_grace_seconds = 10800
>> AND max_index_interval = 2048
>> AND memtable_flush_period_in_ms = 0
>> AND min_index_interval = 128
>> AND read_repair_chance = 0.0
>> AND speculative_retry = '99PERCENTILE';
>>
>> Running this script:
>>
>> for f in *Data.db; do
>>ls -lrt $f
>>output=$(sstablemetadata $f 2>/dev/null)
>>max=$(echo "$output" | grep Maximum\ timestamp | cut -d" " -f3 | cut
>> -c 1-10)
>>min=$(echo "$output" | grep Minimum\ timestamp | cut -d" " -f3 | cut
>> -c 1-10)
>>date -d @$max +'%d/%m/%Y %H:%M:%S'
>>date -d @$min +'%d/%m/%Y %H:%M:%S'
>> done
>>
>> on sstables I'm getting values like these:
>>
>> -rw-r--r-- 1 cassandra cassandra 12137573577 <(213)%20757-3577> Jan 14
>> 20:08
>> mc-22750-big-Data.db
>> 14/01/2018 19:57:41
>> 31/12/2017 19:06:48
>>
>> -rw-r--r-- 1 cassandra cassandra 4669422106 Jan 14 06:55
>> mc-22322-big-Data.db
>> 12/01/2018 07:59:57
>> 28/12/2017 19:08:42
>>
>> My goal is using TWCS for sstables expired fast because lots of new data
>> is coming in. What is the best approach to archive that? Should I
>> disable auto compaction?
>> Thanks in advance.
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: New token allocation and adding a new DC

2018-01-16 Thread Alexander Dejanovski

Hi Oleksandr,

if bootstrap is disabled, it will only skip the streaming phase but will
still go through token allocation and thus should use the new algorithm.
The algorithm won't try to spread data based on size on disk but it will
try to spread token ownership as evenly as possible.

The problem you'll run into is that ownership for a specific keyspace will
be null as long as the replication strategy isn't updated to create
replicas on the new DC.
Quickly thinking would make me do the following :

   - Create enough nodes in the new DC to match the target replication
   factor
   - Alter the replication strategy to add the target number of replicas in
   the new DC (they will start getting writes, and hopefully you've already
   segregated reads)
   - Continue adding nodes in the new DC (with auto_bootstrap = false),
   specifying the right keyspace to optimize token allocations
   - Run rebuild on all nodes in the new DC

I honestly never used it but that's my understanding of how it should work.

Cheers,


On Tue, Jan 16, 2018 at 3:51 PM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> Hello,
>
> We want to add a new rack to an existing cluster (a new Availability Zone
> on AWS).
>
> Currently we have 12 nodes in 2 racks with ~4 TB data per node.  We also
> want to have bigger number of smaller nodes.  In order to minimize the
> streaming we want to add a new DC which will span 3 racks and then
> decommission the old DC.
>
> Following the documented procedure we are going to create all nodes in the
> new DC with auto_bootstrap=false and a distinct dc_suffix.  Then we are
> going to run `nodetool rebuild OLD_DC` on every node.
>
> Since we are observing some uneven load distribution in the old DC, we
> wanted to make use of new token allocation algorithm of Cassandra 3.0+ when
> building the new DC.
>
> To our understanding, this is currently not supported, because the new
> algorithm can only be used during proper node bootstrap?
>
> In theory it should still be possible to allocate tokens in the new DC by
> telling Cassandra which keyspace to optimize for and from which remote DC
> the data will be streamed ultimately, or am I missing something?
>
> Reading through the original implementation ticket I didn't find any
> reference to interaction with rebuild:
> https://issues.apache.org/jira/browse/CASSANDRA-7032
> Nor do I find any open tickets that would discuss the topic.
>
> Is it reasonable to open an issue for that or is there some obvious
> blocker?
>
> Thanks,
> --
> Alex
>
>

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Too many tombstones using TTL

2018-01-16 Thread Alexander Dejanovski

I would not plan on deleting data at the row level as you'll end up with a
lot of tombstones eventually (and you won't even notice them).
It's not healthy to allow that many tombstones to be read, and while your
latency may fit your SLA now, it may not in the future.
Tombstones are going to create a lot of heap pressure and eventually
trigger long GC pauses, which then tend to affect the whole cluster (a slow
node is worse than a down node).

You should definitely separate data that is TTLed and data that is not in
different tables so that you can adjust compaction strategies,
gc_grace_seconds and read patterns accordingly. I understand that it will
complexify your code, but it will prevent severe performance issues in
Cassandra.

Tombstones won't be a problem for repair, they will get repaired as classic
cells. They negatively affect the read path mostly, and use space on disk.

On Tue, Jan 16, 2018 at 2:12 PM Python_Max <python@gmail.com> wrote:

> Hello.
>
> I was planning to remove a row (not partition).
>
> Most of the tombstones are seen in the use case of geographic grid with
> X:Y as partition key and object id (timeuuid) as clustering key where
> objects could be temporary with TTL about 10 hours or fully persistent.
> When I select all objects in specific X:Y I can even hit 100k (default)
> limit for some X:Y. I have changed this limit to 500k since 99.9p read
> latency is < 75ms so I should not (?) care how many tombstones while read
> latency is fine.
>
> Splitting entities to temporary and permanent and using different
> compaction strategies is an option but it will lead to code duplication and
> 2x read queries.
>
> Is my assumption correct about tombstones are not so big problem as soon
> as read latency and disk usage are okey? Are tombstones affect repair time
> (using reaper)?
>
> Thanks.
>
>
> On Tue, Jan 16, 2018 at 11:32 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi,
>>
>> could you be more specific about the deletes you're planning to perform ?
>> This will end up moving your problem somewhere else as you'll be
>> generating new tombstones (and if you're planning on deleting rows, be
>> aware that row level tombstones aren't reported anywhere in the metrics,
>> logs and query traces).
>> Currently you can delete your data at the partition level, which will
>> create a single tombstone that will shadow all your expired (and non
>> expired) data and is very efficient. The read path is optimized for such
>> tombstones and the data won't be fully read from disk nor exchanged between
>> replicas. But that's of course if your use case allows to delete full
>> partitions.
>>
>> We usually model so that we can restrict our reads to live data.
>> If you're creating time series, your clustering key should include a
>> timestamp, which you can use to avoid reading expired data. If your TTL is
>> set to 60 days, you can read only data that is strictly younger than that.
>> Then you can partition by time ranges, and access exclusively partitions
>> that have no chance to be expired yet.
>> Those techniques usually work better with TWCS, but the former could make
>> you hit a lot of SSTables if your partitions can spread over all time
>> buckets, so only use TWCS if you can restrict individual reads to up to 4
>> time windows.
>>
>> Cheers,
>>
>>
>> On Tue, Jan 16, 2018 at 10:01 AM Python_Max <python@gmail.com> wrote:
>>
>>> Hi.
>>>
>>> Thank you very much for detailed explanation.
>>> Seems that there is nothing I can do about it except delete records by
>>> key instead of expiring.
>>>
>>>
>>> On Fri, Jan 12, 2018 at 7:30 PM, Alexander Dejanovski <
>>> a...@thelastpickle.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> As DuyHai said, different TTLs could theoretically be set for different
>>>> cells of the same row. And one TTLed cell could be shadowing another cell
>>>> that has no TTL (say you forgot to set a TTL and set one afterwards by
>>>> performing an update), or vice versa.
>>>> One cell could also be missing from a node without Cassandra knowing.
>>>> So turning an incomplete row that only has expired cells into a tombstone
>>>> row could lead to wrong results being returned at read time : the tombstone
>>>> row could potentially shadow a valid live cell from another replica.
>>>>
>>>> Cassandra needs to retain each TTLed cell and send it to replicas
>>>> during reads to cover all possible cases.
>>>>
>>>>
>>>>

Re: TWCS and autocompaction

2018-01-16 Thread Alexander Dejanovski

Hi,

The overlaps you're seeing on time windows aren't due to automatic
compactions, but to read repairs.
You must be reading at quorum or local_quorum which can perform foreground
read repair in case of digest mismatch.

You can set unchecked_tombstone_compaction to true if you want to perform
single sstable compaction to purge tombstones and a patch has recently been
merged in to allow twcs to delete fully expired data even in case of
overlap between time windows (I can't remember if it's been merged in
3.11.1).
Just so you know, the timestamp considered for time windows is the max
timestamp. You can have old data in recent time windows, but not the
opposite.

Cheers,

Le mar. 16 janv. 2018 à 12:07, Cogumelos Maravilha <
cogumelosmaravi...@sapo.pt> a écrit :

> Hi list,
>
> My settings:
>
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_size': '4', 'compaction_window_unit': 'HOURS',
> 'enabled': 'true', 'max_threshold': '64', 'min_threshold': '2',
> 'tombstone_compaction_interval': '15000', 'tombstone_threshold': '0.2',
> 'unchecked_tombstone_compaction': 'false'}
> AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND crc_check_chance = 0.0
> AND dclocal_read_repair_chance = 0.0
> AND default_time_to_live = 1555200
> AND gc_grace_seconds = 10800
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99PERCENTILE';
>
> Running this script:
>
> for f in *Data.db; do
>ls -lrt $f
>output=$(sstablemetadata $f 2>/dev/null)
>max=$(echo "$output" | grep Maximum\ timestamp | cut -d" " -f3 | cut
> -c 1-10)
>min=$(echo "$output" | grep Minimum\ timestamp | cut -d" " -f3 | cut
> -c 1-10)
>date -d @$max +'%d/%m/%Y %H:%M:%S'
>date -d @$min +'%d/%m/%Y %H:%M:%S'
> done
>
> on sstables I'm getting values like these:
>
> -rw-r--r-- 1 cassandra cassandra 12137573577 Jan 14 20:08
> mc-22750-big-Data.db
> 14/01/2018 19:57:41
> 31/12/2017 19:06:48
>
> -rw-r--r-- 1 cassandra cassandra 4669422106 Jan 14 06:55
> mc-22322-big-Data.db
> 12/01/2018 07:59:57
> 28/12/2017 19:08:42
>
> My goal is using TWCS for sstables expired fast because lots of new data
> is coming in. What is the best approach to archive that? Should I
> disable auto compaction?
> Thanks in advance.
>
>
> ---------
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: vnodes: high availability

2018-01-16 Thread Alexander Dejanovski

 it's always going to be the case that any 2 nodes
> going down result in a loss of QUORUM for some token range.
>
> On 15 January 2018 at 19:59, Kyrylo Lebediev <kyrylo_lebed...@epam.com>
> wrote:
>
> Thanks Alexander!
>
>
> I'm not a MS in math too) Unfortunately.
>
>
> Not sure, but it seems to me that probability of 2/49 in your explanation
> doesn't take into account that vnodes endpoints are almost evenly
> distributed across all nodes (al least it's what I can see from "nodetool
> ring" output).
>
>
>
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
> of course this vnodes illustration is a theoretical one, but there no 2
> nodes on that diagram that can be switched off without losing a key range
> (at CL=QUORUM).
>
>
> That's because vnodes_per_node=8 > Nnodes=6.
>
> As far as I understand, situation is getting worse with increase of
> vnodes_per_node/Nnode ratio.
>
> Please, correct me if I'm wrong.
>
>
> How would the situation differ from this example by DataStax, if we had a
> real-life 6-nodes cluster with 8 vnodes on each node?
>
>
> Regards,
>
> Kyrill
>
>
> --
> *From:* Alexander Dejanovski <a...@thelastpickle.com>
> *Sent:* Monday, January 15, 2018 8:14:21 PM
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
>
> I was corrected off list that the odds of losing data when 2 nodes are
> down isn't dependent on the number of vnodes, but only on the number of
> nodes.
> The more vnodes, the smaller the chunks of data you may lose, and vice
> versa.
>
> I officially suck at statistics, as expected :)
>
> Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <
> a...@thelastpickle.com> a écrit :
>
> Hi Kyrylo,
>
> the situation is a bit more nuanced than shown by the Datastax diagram,
> which is fairly theoretical.
> If you're using SimpleStrategy, there is no rack awareness. Since vnode
> distribution is purely random, and the replica for a vnode will be placed
> on the node that owns the next vnode in token order (yeah, that's not easy
> to formulate), you end up with statistics only.
>
> I kinda suck at maths but I'm going to risk making a fool of myself :)
>
> The odds for one vnode to be replicated on another node are, in your case,
> 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
> Given you have 256 vnodes, the odds for at least one vnode of a single
> node to exist on another one is 256*(2/49) = 10.4%
> Since the relationship is bi-directional (there are the same odds for node
> B to have a vnode replicated on node A than the opposite), that doubles the
> odds of 2 nodes being both replica for at least one vnode : 20.8%.
>
> Having a smaller number of vnodes will decrease the odds, just as having
> more nodes in the cluster.
> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in
> that area...)
>
> How many queries that will affect is a different question as it depends on
> which partition currently exist and are queried in the unavailable token
> ranges.
>
> Then you have rack awareness that comes with NetworkTopologyStrategy :
> If the number of replicas (3 in your case) is proportional to the number
> of racks, Cassandra will spread replicas in different ones.
> In that situation, you can theoretically lose as many nodes as you want in
> a single rack, you will still have two other replicas available to satisfy
> quorum in the remaining racks.
> If you start losing nodes in different racks, we're back to doing maths
> (but the odds will get slightly different).
>
> That makes maintenance predictable because you can shut down as many nodes
> as you want in a single rack without losing QUORUM.
>
> Feel free to correct my numbers if I'm wrong.
>
> Cheers,
>
>
>
>
>
> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <kyrylo_lebed...@epam.com>
> wrote:
>
> Thanks, Rahul.
>
> But in your example, at the same time loss of Node3 and Node6 leads to
> loss of ranges N, C, J at consistency level QUORUM.
>
>
> As far as I understand in case vnodes > N_nodes_in_cluster and
> endpoint_snitch=SimpleSnitch, since:
>
>
> 1) "secondary" replicas are placed on two nodes 'next' to the node
> responsible for a range (in case of RF=3)
>
> 2) there are a lot of vnodes on each node
> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>
>
> we get all physical nodes (servers) having mutually adjacent  token rages.
> Is it correct?
>
> At least in case of my real-world ~50-nodes cluster with

Re: Too many tombstones using TTL

2018-01-16 Thread Alexander Dejanovski

Hi,

could you be more specific about the deletes you're planning to perform ?
This will end up moving your problem somewhere else as you'll be generating
new tombstones (and if you're planning on deleting rows, be aware that row
level tombstones aren't reported anywhere in the metrics, logs and query
traces).
Currently you can delete your data at the partition level, which will
create a single tombstone that will shadow all your expired (and non
expired) data and is very efficient. The read path is optimized for such
tombstones and the data won't be fully read from disk nor exchanged between
replicas. But that's of course if your use case allows to delete full
partitions.

We usually model so that we can restrict our reads to live data.
If you're creating time series, your clustering key should include a
timestamp, which you can use to avoid reading expired data. If your TTL is
set to 60 days, you can read only data that is strictly younger than that.
Then you can partition by time ranges, and access exclusively partitions
that have no chance to be expired yet.
Those techniques usually work better with TWCS, but the former could make
you hit a lot of SSTables if your partitions can spread over all time
buckets, so only use TWCS if you can restrict individual reads to up to 4
time windows.

Cheers,

On Tue, Jan 16, 2018 at 10:01 AM Python_Max <python@gmail.com> wrote:

> Hi.
>
> Thank you very much for detailed explanation.
> Seems that there is nothing I can do about it except delete records by key
> instead of expiring.
>
>
> On Fri, Jan 12, 2018 at 7:30 PM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
>> Hi,
>>
>> As DuyHai said, different TTLs could theoretically be set for different
>> cells of the same row. And one TTLed cell could be shadowing another cell
>> that has no TTL (say you forgot to set a TTL and set one afterwards by
>> performing an update), or vice versa.
>> One cell could also be missing from a node without Cassandra knowing. So
>> turning an incomplete row that only has expired cells into a tombstone row
>> could lead to wrong results being returned at read time : the tombstone row
>> could potentially shadow a valid live cell from another replica.
>>
>> Cassandra needs to retain each TTLed cell and send it to replicas during
>> reads to cover all possible cases.
>>
>>
>> On Fri, Jan 12, 2018 at 5:28 PM Python_Max <python@gmail.com> wrote:
>>
>>> Thank you for response.
>>>
>>> I know about the option of setting TTL per column or even per item in
>>> collection. However in my example entire row has expired, shouldn't
>>> Cassandra be able to detect this situation and spawn a single tombstone for
>>> entire row instead of many?
>>> Is there any reason not doing this except that no one needs it? Is this
>>> suitable for feature request or improvement?
>>>
>>> Thanks.
>>>
>>> On Wed, Jan 10, 2018 at 4:52 PM, DuyHai Doan <doanduy...@gmail.com>
>>> wrote:
>>>
>>>> "The question is why Cassandra creates a tombstone for every column
>>>> instead of single tombstone per row?"
>>>>
>>>> --> Simply because technically it is possible to set different TTL
>>>> value on each column of a CQL row
>>>>
>>>> On Wed, Jan 10, 2018 at 2:59 PM, Python_Max <python@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello, C* users and experts.
>>>>>
>>>>> I have (one more) question about tombstones.
>>>>>
>>>>> Consider the following example:
>>>>> cqlsh> create keyspace test_ttl with replication = {'class':
>>>>> 'SimpleStrategy', 'replication_factor': '1'}; use test_ttl;
>>>>> cqlsh> create table items(a text, b text, c1 text, c2 text, c3 text,
>>>>> primary key (a, b));
>>>>> cqlsh> insert into items(a,b,c1,c2,c3) values('AAA', 'BBB', 'C111',
>>>>> 'C222', 'C333') using ttl 60;
>>>>> bash$ nodetool flush
>>>>> bash$ sleep 60
>>>>> bash$ nodetool compact test_ttl items
>>>>> bash$ sstabledump mc-2-big-Data.db
>>>>>
>>>>> [
>>>>>   {
>>>>> "partition" : {
>>>>>   "key" : [ "AAA" ],
>>>>>   "position" : 0
>>>>> },
>>>>> "rows" : [
>>>>>   {
>>>>> "type" : "row",
>>>>> "position"

Re: vnodes: high availability

2018-01-15 Thread Alexander Dejanovski

I was corrected off list that the odds of losing data when 2 nodes are down
isn't dependent on the number of vnodes, but only on the number of nodes.
The more vnodes, the smaller the chunks of data you may lose, and vice
versa.

I officially suck at statistics, as expected :)

Le lun. 15 janv. 2018 à 17:55, Alexander Dejanovski <a...@thelastpickle.com>
a écrit :

> Hi Kyrylo,
>
> the situation is a bit more nuanced than shown by the Datastax diagram,
> which is fairly theoretical.
> If you're using SimpleStrategy, there is no rack awareness. Since vnode
> distribution is purely random, and the replica for a vnode will be placed
> on the node that owns the next vnode in token order (yeah, that's not easy
> to formulate), you end up with statistics only.
>
> I kinda suck at maths but I'm going to risk making a fool of myself :)
>
> The odds for one vnode to be replicated on another node are, in your case,
> 2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
> Given you have 256 vnodes, the odds for at least one vnode of a single
> node to exist on another one is 256*(2/49) = 10.4%
> Since the relationship is bi-directional (there are the same odds for node
> B to have a vnode replicated on node A than the opposite), that doubles the
> odds of 2 nodes being both replica for at least one vnode : 20.8%.
>
> Having a smaller number of vnodes will decrease the odds, just as having
> more nodes in the cluster.
> (now once again, I hope my maths aren't fully wrong, I'm pretty rusty in
> that area...)
>
> How many queries that will affect is a different question as it depends on
> which partition currently exist and are queried in the unavailable token
> ranges.
>
> Then you have rack awareness that comes with NetworkTopologyStrategy :
> If the number of replicas (3 in your case) is proportional to the number
> of racks, Cassandra will spread replicas in different ones.
> In that situation, you can theoretically lose as many nodes as you want in
> a single rack, you will still have two other replicas available to satisfy
> quorum in the remaining racks.
> If you start losing nodes in different racks, we're back to doing maths
> (but the odds will get slightly different).
>
> That makes maintenance predictable because you can shut down as many nodes
> as you want in a single rack without losing QUORUM.
>
> Feel free to correct my numbers if I'm wrong.
>
> Cheers,
>
>
>
>
>
> On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <kyrylo_lebed...@epam.com>
> wrote:
>
>> Thanks, Rahul.
>>
>> But in your example, at the same time loss of Node3 and Node6 leads to
>> loss of ranges N, C, J at consistency level QUORUM.
>>
>>
>> As far as I understand in case vnodes > N_nodes_in_cluster and
>> endpoint_snitch=SimpleSnitch, since:
>>
>>
>> 1) "secondary" replicas are placed on two nodes 'next' to the node
>> responsible for a range (in case of RF=3)
>>
>> 2) there are a lot of vnodes on each node
>> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>>
>>
>> we get all physical nodes (servers) having mutually adjacent  token rages.
>> Is it correct?
>>
>> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3
>> for this command:
>>
>> nodetool ring | grep '^' | awk '{print $1}' | uniq | grep -B2
>> -A2 '' | grep -v '' | grep -v '^--' | sort |
>> uniq | wc -l
>>
>> returned number which equals to Nnodes -1, what means that I can't switch
>> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>>
>>
>> Thanks,
>>
>> Kyrill
>> --
>> *From:* Rahul Neelakantan <ra...@rahul.be>
>> *Sent:* Monday, January 15, 2018 5:20:20 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: vnodes: high availability
>>
>> Not necessarily. It depends on how the token ranges for the vNodes are
>> assigned to them. For example take a look at this diagram
>>
>> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>>
>> In the vNode part of the diagram, you will see that Loss of Node 3 and
>> Node 6, will still not have any effect on Token Range A. But yes if you
>> lose two nodes that both have Token Range A assigned to them (Say Node 1
>> and Node 2), you will have unavailability with your specified configuration.
>>
>> You can sort of circumvent this by using the DataStax Java Driver and
>> having the client recognize a degraded cluster and operate temporarily in
>> downgraded consistency mode

Re: vnodes: high availability

2018-01-15 Thread Alexander Dejanovski

Hi Kyrylo,

the situation is a bit more nuanced than shown by the Datastax diagram,
which is fairly theoretical.
If you're using SimpleStrategy, there is no rack awareness. Since vnode
distribution is purely random, and the replica for a vnode will be placed
on the node that owns the next vnode in token order (yeah, that's not easy
to formulate), you end up with statistics only.

I kinda suck at maths but I'm going to risk making a fool of myself :)

The odds for one vnode to be replicated on another node are, in your case,
2/49 (out of 49 remaining nodes, 2 replicas need to be placed).
Given you have 256 vnodes, the odds for at least one vnode of a single node
to exist on another one is 256*(2/49) = 10.4%
Since the relationship is bi-directional (there are the same odds for node
B to have a vnode replicated on node A than the opposite), that doubles the
odds of 2 nodes being both replica for at least one vnode : 20.8%.

Having a smaller number of vnodes will decrease the odds, just as having
more nodes in the cluster.
(now once again, I hope my maths aren't fully wrong, I'm pretty rusty in
that area...)

How many queries that will affect is a different question as it depends on
which partition currently exist and are queried in the unavailable token
ranges.

Then you have rack awareness that comes with NetworkTopologyStrategy :
If the number of replicas (3 in your case) is proportional to the number of
racks, Cassandra will spread replicas in different ones.
In that situation, you can theoretically lose as many nodes as you want in
a single rack, you will still have two other replicas available to satisfy
quorum in the remaining racks.
If you start losing nodes in different racks, we're back to doing maths
(but the odds will get slightly different).

That makes maintenance predictable because you can shut down as many nodes
as you want in a single rack without losing QUORUM.

Feel free to correct my numbers if I'm wrong.

Cheers,

On Mon, Jan 15, 2018 at 5:27 PM Kyrylo Lebediev <kyrylo_lebed...@epam.com>
wrote:

> Thanks, Rahul.
>
> But in your example, at the same time loss of Node3 and Node6 leads to
> loss of ranges N, C, J at consistency level QUORUM.
>
>
> As far as I understand in case vnodes > N_nodes_in_cluster and
> endpoint_snitch=SimpleSnitch, since:
>
>
> 1) "secondary" replicas are placed on two nodes 'next' to the node
> responsible for a range (in case of RF=3)
>
> 2) there are a lot of vnodes on each node
> 3) ranges are evenly distribusted between vnodes in case of SimpleSnitch,
>
>
> we get all physical nodes (servers) having mutually adjacent  token rages.
> Is it correct?
>
> At least in case of my real-world ~50-nodes cluster with nvodes=256, RF=3
> for this command:
>
> nodetool ring | grep '^' | awk '{print $1}' | uniq | grep -B2
> -A2 '' | grep -v '' | grep -v '^--' | sort |
> uniq | wc -l
>
> returned number which equals to Nnodes -1, what means that I can't switch
> off 2 nodes at the same time w/o losing of some keyrange for CL=QUORUM.
>
>
> Thanks,
>
> Kyrill
> --
> *From:* Rahul Neelakantan <ra...@rahul.be>
> *Sent:* Monday, January 15, 2018 5:20:20 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: vnodes: high availability
>
> Not necessarily. It depends on how the token ranges for the vNodes are
> assigned to them. For example take a look at this diagram
>
> http://docs.datastax.com/en/archived/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
>
> In the vNode part of the diagram, you will see that Loss of Node 3 and
> Node 6, will still not have any effect on Token Range A. But yes if you
> lose two nodes that both have Token Range A assigned to them (Say Node 1
> and Node 2), you will have unavailability with your specified configuration.
>
> You can sort of circumvent this by using the DataStax Java Driver and
> having the client recognize a degraded cluster and operate temporarily in
> downgraded consistency mode
>
>
> http://docs.datastax.com/en/latest-java-driver-api/com/datastax/driver/core/policies/DowngradingConsistencyRetryPolicy.html
>
> - Rahul
>
> On Mon, Jan 15, 2018 at 10:04 AM, Kyrylo Lebediev <
> kyrylo_lebed...@epam.com> wrote:
>
> Hi,
>
>
> Let's say we have a C* cluster with following parameters:
>
>  - 50 nodes in the cluster
>
>  - RF=3
>
>  - vnodes=256 per node
>
>  - CL for some queries = QUORUM
>
>  - endpoint_snitch = SimpleSnitch
>
>
> Is it correct that 2 any nodes down will cause unavailability of a
> keyrange at CL=QUORUM?
>
>
> Regards,
>
> Kyrill
>
>
>

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Too many tombstones using TTL

2018-01-12 Thread Alexander Dejanovski

Hi,

As DuyHai said, different TTLs could theoretically be set for different
cells of the same row. And one TTLed cell could be shadowing another cell
that has no TTL (say you forgot to set a TTL and set one afterwards by
performing an update), or vice versa.
One cell could also be missing from a node without Cassandra knowing. So
turning an incomplete row that only has expired cells into a tombstone row
could lead to wrong results being returned at read time : the tombstone row
could potentially shadow a valid live cell from another replica.

Cassandra needs to retain each TTLed cell and send it to replicas during
reads to cover all possible cases.


On Fri, Jan 12, 2018 at 5:28 PM Python_Max <python@gmail.com> wrote:

> Thank you for response.
>
> I know about the option of setting TTL per column or even per item in
> collection. However in my example entire row has expired, shouldn't
> Cassandra be able to detect this situation and spawn a single tombstone for
> entire row instead of many?
> Is there any reason not doing this except that no one needs it? Is this
> suitable for feature request or improvement?
>
> Thanks.
>
> On Wed, Jan 10, 2018 at 4:52 PM, DuyHai Doan <doanduy...@gmail.com> wrote:
>
>> "The question is why Cassandra creates a tombstone for every column
>> instead of single tombstone per row?"
>>
>> --> Simply because technically it is possible to set different TTL value
>> on each column of a CQL row
>>
>> On Wed, Jan 10, 2018 at 2:59 PM, Python_Max <python@gmail.com> wrote:
>>
>>> Hello, C* users and experts.
>>>
>>> I have (one more) question about tombstones.
>>>
>>> Consider the following example:
>>> cqlsh> create keyspace test_ttl with replication = {'class':
>>> 'SimpleStrategy', 'replication_factor': '1'}; use test_ttl;
>>> cqlsh> create table items(a text, b text, c1 text, c2 text, c3 text,
>>> primary key (a, b));
>>> cqlsh> insert into items(a,b,c1,c2,c3) values('AAA', 'BBB', 'C111',
>>> 'C222', 'C333') using ttl 60;
>>> bash$ nodetool flush
>>> bash$ sleep 60
>>> bash$ nodetool compact test_ttl items
>>> bash$ sstabledump mc-2-big-Data.db
>>>
>>> [
>>>   {
>>> "partition" : {
>>>   "key" : [ "AAA" ],
>>>   "position" : 0
>>> },
>>> "rows" : [
>>>   {
>>> "type" : "row",
>>> "position" : 58,
>>> "clustering" : [ "BBB" ],
>>> "liveness_info" : { "tstamp" : "2018-01-10T13:29:25.777Z", "ttl"
>>> : 60, "expires_at" : "2018-01-10T13:30:25Z", "expired" : true },
>>> "cells" : [
>>>   { "name" : "c1", "deletion_info" : { "local_delete_time" :
>>> "2018-01-10T13:29:25Z" }
>>>   },
>>>   { "name" : "c2", "deletion_info" : { "local_delete_time" :
>>> "2018-01-10T13:29:25Z" }
>>>   },
>>>   { "name" : "c3", "deletion_info" : { "local_delete_time" :
>>> "2018-01-10T13:29:25Z" }
>>>   }
>>> ]
>>>   }
>>> ]
>>>   }
>>> ]
>>>
>>> The question is why Cassandra creates a tombstone for every column
>>> instead of single tombstone per row?
>>>
>>> In production environment I have a table with ~30 columns and It gives
>>> me a warning for 30k tombstones and 300 live rows. It is 30 times more then
>>> it could be.
>>> Can this behavior be tuned in some way?
>>>
>>> Thanks.
>>>
>>> --
>>> Best regards,
>>> Python_Max.
>>>
>>
>>
>
>
> --
> Best regards,
> Python_Max.
>


-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Full repair caused disk space increase issue

2018-01-04 Thread Alexander Dejanovski

Hi Simon,

since Cassandra 2.2, anticompaction is performed in all types of repairs,
except subrange repair.
Given that you have some very big SSTables, the temporary space used by
anticompaction (which does the opposite of compaction : read one sstable,
output two sstables) will impact your disk usage while it's running. It
will reach a peak when they are close to completion.
The anticompaction that is reported by compactionstats is currently using
an extra 147GB*[compression ratio]. So with a compression ratio of 0.3 for
example, that would be 44GB that will get reclaimed shortly after the
anticompaction is over.

You can check the current overhead of compaction by listing temporary
sstables : *tmp*Data.db

It's also possible that you have some overstreaming that occurred during
your repair, which will increase the size on disk until it gets compacted
away (over time).
You should also check if you don't have snapshots sticking around by
running "nodetool listsnapshots".

Now, you're mentioning that you ran repair to evict tombstones. This is not
what repair does, and tombstones are evicted through compaction when they
meet the requirements (gc_grace_seconds and all the cells of the partition
involved in the same compaction).
If you want to optimize your tombstone eviction, especially with STCS, I
advise to turn on unchecked_tombstone_compaction, which will allow single
sstables compactions to be triggered by Cassandra when there is more than
20% of estimated droppable tombstones in an SSTable.
You can check your current droppable tombstone ratio by running
sstablemetadata on all your sstables.
A command like the following should do the trick (it will print out min/max
timestamps too) :

for f in *Data.db; do meta=$(sudo sstablemetadata $f); echo -e "Max:"
$(date --date=@$(echo "$meta" | grep Maximum\ time | cut -d" "  -f3| cut -c
1-10) '+%m/%d/%Y') "Min:" $(date --date=@$(echo "$meta" | grep Minimum\
time | cut -d" "  -f3| cut -c 1-10) '+%m/%d/%Y') $(echo "$meta" | grep
droppable) ' \t ' $(ls -lh $f | awk '{print $5" "$6" "$7" "$8" "$9}'); done
| sort

Check if the 20% threshold is high enough by verifying that newly created
SSTables don't already reach that level, and adjust accordingly if it's the
case (for example raise the threshold to 50%).

To activate the tombstone compactions, with a 50% droppable tombstone
threshold, perform the following statement on your table :

ALTER TABLE cargts.eventdata WITH compaction =
{'class':'SizeTieredCompactionStrategy',
'unchecked_tombstone_compaction':'true', 'tombstone_threshold':'0.5'}

Picking the right threshold is up to you.
Note that tombstone compactions running more often will use temporary space
as well, but they should help evicting tombstones faster if the partitions
are contained within a single SSTable.

If you are dealing with TTLed data and your partitions spread over time,
I'd strongly suggest considering TWCS instead of STCS which can remove
fully expired SSTables much more efficiently.

Cheers,

On Fri, Jan 5, 2018 at 7:43 AM wxn...@zjqunshuo.com <wxn...@zjqunshuo.com>
wrote:

> Hi All,
> In order to evict tombstones, I issued full repair with the command
> "nodetool -pr -full". Then the data load size was indeed decreased by 100G
> for each node by using "nodetool status" to check. But the actual disk
> usage increased by 500G for each node. The repair is still ongoing and
> leaving less and less disk space for me.
>
> From compactionstats, I see "Anticompaction after repair". Based on my
> understanding, it is for incremental repair by changing sstable metadata to
> indicate which file is repaired, so in next repair it is not going to be
> repaired. But I'm doing full repair, Why Anticompaction?
>
> 9e09c490-f1be-11e7-b2ea-b3085f85ccae   Anticompaction after repair cargts 
>   eventdata147.3 GB   158.54 GB   bytes 92.91%
>
> There are pare sstable files. I mean they have the same timestamp as
> below. I guess one of them or both of them should be deleted after during
> repair, but for some unknown reason, the repair process failed to delete
> them.
> -rw-r--r-- 1 root root 237G Dec 31 12:48 lb-123800-big-Data.db
> -rw-r--r-- 1 root root 243G Dec 31 12:48 lb-123801-big-Data.db
>
> C* version is 2.2.8 with STCS. Any ideas?
>
> Cheers,
> -Simon
>

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Deleted data comes back on node decommission

2017-12-15 Thread Alexander Dejanovski

Hi Max,

I don't know if it's related to your issue but on a side note, if you
decide to use Reaper (and use full repairs, not incremental ones), but mix
that with "nodetool repair", you'll end up with 2 pools of SSTables that
cannot get compacted together.
Reaper uses subrange repair which doesn't mark SSTables are repaired (no
anticompaction is performed, repairedAt remains at 0), while using nodetool
in full and incremental modes will perform anticompaction.

SSTables with repairedAt > 0 cannot be compacted with SSTables with
repairedAt = 0.

Bottom line is that if you want your SSTables to be compacted together
naturally, you have to run repairs either exclusively through Reaper or
exclusively through nodetool.
If you decide to use Reaper exclusively, you have to revert the repairedAt
value to 0 for all sstables on all nodes, using sstablerepairedset
<https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsSSTableRepairedSet.html>
.

Cheers,

On Fri, Dec 15, 2017 at 4:57 PM Jeff Jirsa <jji...@gmail.com> wrote:

> The generation (integer id in file names) doesn’t matter for ordering like
> this
>
> It matters in schema tables for addition of new columns/types, but it’s
> irrelevant for normal tables - you could do a user defined compaction on
> 31384 right now and it’d be rewritten as-is (minus purgable data) with the
> new highest generation, even though it’s all old data.
>
>
> --
> Jeff Jirsa
>
>
> On Dec 15, 2017, at 6:55 AM, Python_Max <python@gmail.com> wrote:
>
> Hi, Kurt.
>
> Thank you for response.
>
>
> Repairs are marked as 'done' without errors in reaper history.
>
> Example of 'wrong order':
>
> * file mc-31384-big-Data.db contains tombstone:
>
> {
> "type" : "row",
> "position" : 7782,
> "clustering" : [ "9adab970-b46d-11e7-a5cd-a1ba8cfc1426" ],
> "deletion_info" : { "marked_deleted" :
> "2017-10-28T04:51:20.589394Z", "local_delete_time" : "2017-10-28T04:51:20Z"
> },
> "cells" : [ ]
>   }
>
> * file mc-31389-big-Data.db contains data:
>
> {
> "type" : "row",
> "position" : 81317,
> "clustering" : [ "9adab970-b46d-11e7-a5cd-a1ba8cfc1426" ],
> "liveness_info" : { "tstamp" : "2017-10-19T01:34:10.055389Z" },
> "cells" : [...]
>   }
> Index 31384 is less than 31389 but I'm not sure whether it matters at all.
>
> I assume that data and tombsones are not compacting due to another reason:
> the tokens are not owned by that node anymore and the only way to purge
> such keys is 'nodetool cleanup', isn't it?
>
>
> On 14.12.17 16:14, kurt greaves wrote:
>
> Are you positive your repairs are completing successfully? Can you send
> through an example of the data in the wrong order? What you're saying
> certainly shouldn't happen, but there's a lot of room for mistakes.
>
> On 14 Dec. 2017 20:13, "Python_Max" <python@gmail.com> wrote:
>
>> Thank you for reply.
>>
>> No, I did not execute 'nodetool cleanup'. Documentation
>> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRemoveNode.html
>> does not mention that cleanup is required.
>>
>> Do yo think that extra data which node is not responsible for can lead to
>> zombie data?
>>
>>
>> On 13.12.17 18:43, Jeff Jirsa wrote:
>>
>>> Did you run cleanup before you shrank the cluster?
>>>
>>>
>> --
>>
>> Best Regards,
>> Python_Max.
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> happen
>
>
> --
>
> Best Regards,
> Python_Max.
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Huge system.batches table after joining a node (Cassandra 3.11.1)

2017-12-07 Thread Alexander Dejanovski

Just a heads up that (in case you missed it) MVs were retroactively marked
as experimental and that a large part of the community considers they
should not be used in production.

On Thu, Dec 7, 2017 at 4:53 PM Alexander Dejanovski <a...@thelastpickle.com>
wrote:

> Yes, MVs use batches during bootstraps and decommissions.
>
> You can read more about it here :
> https://issues.apache.org/jira/browse/CASSANDRA-13065
> and here : https://issues.apache.org/jira/browse/CASSANDRA-13614
>
> Things will improve in 4.0 only it seems.
>
> On Thu, Dec 7, 2017 at 4:31 PM Christian Lorenz <
> christian.lor...@webtrekk.com> wrote:
>
>> Hi Alexander,
>>
>>
>>
>> yes we use MV’s. The size of the batch table is around 10GB on the
>> existing nodes. Also seems pretty high.
>>
>> So is this table (also) used to process MV building?
>>
>>
>>
>> Regards,
>>
>> Christian
>>
>> *Von: *Alexander Dejanovski <a...@thelastpickle.com>
>> *Antworten an: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Datum: *Donnerstag, 7. Dezember 2017 um 16:24
>> *An: *"user@cassandra.apache.org" <user@cassandra.apache.org>
>> *Betreff: *Re: Huge system.batches table after joining a node (Cassandra
>> 3.11.1)
>>
>>
>>
>> Hi Christian,
>>
>>
>>
>> it is probably not safe to drop it because it contains all logged batches
>> that are supposed to be played on the cluster.
>>
>> The size of the batches table should go down as they get processed
>> (although 100GB is a pretty huge batch log...)
>>
>>
>>
>> Do you use Materialized Views in your data model ?
>>
>> You just bootstrapped a new node and the table grew on all other nodes ?
>>
>>
>>
>> On Thu, Dec 7, 2017 at 12:25 PM Christian Lorenz <
>> christian.lor...@webtrekk.com> wrote:
>>
>> Hi,
>>
>>
>>
>> after joining a node into an existing cluster, the table system.batches
>> became quite large (100GB) which is about 1/3 of the nodes size.
>>
>> Is it safe to truncate the table?
>>
>>
>>
>> Regards,
>>
>> Christian
>>
>>
>>
>> --
>>
>> -
>>
>> Alexander Dejanovski
>>
>> France
>>
>> @alexanderdeja
>>
>>
>>
>> Consultant
>>
>> Apache Cassandra Consulting
>>
>> http://www.thelastpickle.com
>>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Huge system.batches table after joining a node (Cassandra 3.11.1)

2017-12-07 Thread Alexander Dejanovski

Yes, MVs use batches during bootstraps and decommissions.

You can read more about it here :
https://issues.apache.org/jira/browse/CASSANDRA-13065
and here : https://issues.apache.org/jira/browse/CASSANDRA-13614

Things will improve in 4.0 only it seems.

On Thu, Dec 7, 2017 at 4:31 PM Christian Lorenz <
christian.lor...@webtrekk.com> wrote:

> Hi Alexander,
>
>
>
> yes we use MV’s. The size of the batch table is around 10GB on the
> existing nodes. Also seems pretty high.
>
> So is this table (also) used to process MV building?
>
>
>
> Regards,
>
> Christian
>
> *Von: *Alexander Dejanovski <a...@thelastpickle.com>
> *Antworten an: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Datum: *Donnerstag, 7. Dezember 2017 um 16:24
> *An: *"user@cassandra.apache.org" <user@cassandra.apache.org>
> *Betreff: *Re: Huge system.batches table after joining a node (Cassandra
> 3.11.1)
>
>
>
> Hi Christian,
>
>
>
> it is probably not safe to drop it because it contains all logged batches
> that are supposed to be played on the cluster.
>
> The size of the batches table should go down as they get processed
> (although 100GB is a pretty huge batch log...)
>
>
>
> Do you use Materialized Views in your data model ?
>
> You just bootstrapped a new node and the table grew on all other nodes ?
>
>
>
> On Thu, Dec 7, 2017 at 12:25 PM Christian Lorenz <
> christian.lor...@webtrekk.com> wrote:
>
> Hi,
>
>
>
> after joining a node into an existing cluster, the table system.batches
> became quite large (100GB) which is about 1/3 of the nodes size.
>
> Is it safe to truncate the table?
>
>
>
> Regards,
>
> Christian
>
>
>
> --
>
> -
>
> Alexander Dejanovski
>
> France
>
> @alexanderdeja
>
>
>
> Consultant
>
> Apache Cassandra Consulting
>
> http://www.thelastpickle.com
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Huge system.batches table after joining a node (Cassandra 3.11.1)

2017-12-07 Thread Alexander Dejanovski

Hi Christian,

it is probably not safe to drop it because it contains all logged batches
that are supposed to be played on the cluster.
The size of the batches table should go down as they get processed
(although 100GB is a pretty huge batch log...)

Do you use Materialized Views in your data model ?
You just bootstrapped a new node and the table grew on all other nodes ?

On Thu, Dec 7, 2017 at 12:25 PM Christian Lorenz <
christian.lor...@webtrekk.com> wrote:

> Hi,
>
>
>
> after joining a node into an existing cluster, the table system.batches
> became quite large (100GB) which is about 1/3 of the nodes size.
>
> Is it safe to truncate the table?
>
>
>
> Regards,
>
> Christian
>
>
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: DC aware failover

2017-11-16 Thread Alexander Dejanovski

Hi Anil,

yes, that's the one in use there.
I should probably merge it into master to avoid confusion.

Cheers,

On Fri, Nov 17, 2017 at 6:12 AM CPC <acha...@gmail.com> wrote:

> Hi Alex,
>
> Is lost-token-range detection impl finished? Since this feature is more
> appealing I want to test it.
>
> Thank you for your help
>
>
> On Nov 16, 2017 10:35 AM, "Alexander Dejanovski" <a...@thelastpickle.com>
> wrote:
>
> Hi,
>
> The policy is used in production at least in my former company.
>
> I can help if you have issues using it.
>
> Cheers,
>
> Le jeu. 16 nov. 2017 à 08:32, CPC <acha...@gmail.com> a écrit :
>
>> Hi,
>>
>> We want to implement DC aware failover policy. For example if application
>> could not reach some part of the ring or if we loose 50% of local DC then
>> we want our application automatically to switch other DC. We found this
>> project on GitHub
>> https://github.com/adejanovski/cassandra-dcaware-failover but we don't
>> know whether it is stable and used in production. Do you know about this
>> project or do you know other projects that provide same kind of
>> functionality.
>>
>> Thanks...
>>
> --
> -----
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: DC aware failover

2017-11-15 Thread Alexander Dejanovski

Hi,

The policy is used in production at least in my former company.

I can help if you have issues using it.

Cheers,

Le jeu. 16 nov. 2017 à 08:32, CPC <acha...@gmail.com> a écrit :

> Hi,
>
> We want to implement DC aware failover policy. For example if application
> could not reach some part of the ring or if we loose 50% of local DC then
> we want our application automatically to switch other DC. We found this
> project on GitHub
> https://github.com/adejanovski/cassandra-dcaware-failover but we don't
> know whether it is stable and used in production. Do you know about this
> project or do you know other projects that provide same kind of
> functionality.
>
> Thanks...
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: High IO Util using TimeWindowCompaction

2017-11-15 Thread Alexander Dejanovski

Hi Kurt,

it seems highly unlikely that TWCS is responsible for your problems since
you're throttling compaction way below what i3 instances can provide.
For such instances, we would advise to use 8 concurrent compactors with
high compaction throughput (>200MB/s, if not unthrottled).

We've had reports and observed some inconsistent I/O behaviors with some i3
instances (not much lately though), so it could be what's biting you.
It would be helpful to provide a bit more info here to troubleshoot this :

   - The output of the following command during one of the 100% util phase
   : iostat -dmx 2 50
   - The output of : nodetool tablehistograms prod_dedupe event_hashes
   - The output of the following command during one of the 100% util phase
   : nodetool compactionstats -H
   - The output of : nodetool tpstats


Since you have very tiny partitions, we would advise to lower or disable
readahead, but you're not performing reads anyway on that cluster.

It would be good to check how 3.11 with TWCS performs on the same hardware
as the 3.7 cluster (c3.4xl) to narrow down the suspect list. Any chance you
can test this ?
Also, which OS are you using on the i3 instances ?

Thanks



On Mon, Nov 13, 2017 at 11:51 PM Kurtis Norwood <k...@amplitude.com> wrote:

> I've been testing out cassandra 3.11 (currently using 3.7) and have been
> observing really high io util occasionally that sometimes results in
> temporary flatlining at 100% io util for an extended period. I think my use
> case is pretty simple and currently only testing part of it on this new
> version so looking for advice on what might be going wrong.
>
> Use Case: I am using cassandra as basically a large "set", my table schema
> is incredibly simple, just a primary key. Records are all written with the
> same TTL (7 days). Only queries are inserting a key (which we expect to
> only happen once) and checking whether that key exists in the table. In my
> 3.7 cluster I am using DateTieredCompaction and running on c3.4xlarge (x30)
> in AWS. I've been experimenting with i3.4xlarge and wanted to also try
> TimeWindowCompaction to see if we could get better performance when adding
> machines to the cluster, that was always a really painful experience in 3.7
> with DateTieredCompaction and the docs say TimeWindowCompaction is ideal
> for my use case.
>
> Right now I am running a new cluster with 3.11 and TimeWindowCompaction
> alongside the old cluster and doing writes to both. Only reads go to the
> old cluster while I go through this preliminary testing. So the 3.11
> cluster receives between 90K to 150K writes/second and no reads. This
> morning for a period of about 30 minutes the cluster was at 100% ioutil and
> eventually recovered from this state. At that time it was only receiving
> ~100K writes/second. I don't see anything interesting in the logs that
> indicate what is going on, and I don't think a sudden compaction is the
> issue since I have limits on compaction throughput.
>
> Staying on 3.7 would be a major bummer so looking for advice.
>
> Some information that might be useful:
>
> compaction throughput - 16MB/s
> concurrent compactors - 4
> machine type - i3.4xlarge (x20)
> disk - RAID0 across 2 NVMe SSDs
>
> Table Schema looks like this:
>
> CREATE TABLE prod_dedupe.event_hashes (
>
> app int,
>
> hash_value blob,
>
> PRIMARY KEY ((app, hash_value))
>
> ) WITH bloom_filter_fp_chance = 0.01
>
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>
> AND comment = 'For deduping'
>
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_size': '4', 'compaction_window_unit': 'HOURS',
> 'max_threshold': '64', 'min_threshold': '4'}
>
> AND compression = {'chunk_length_in_kb': '4', 'class': '
> org.apache.cassandra.io.compress.LZ4Compressor'}
>
> AND crc_check_chance = 1.0
>
> AND dclocal_read_repair_chance = 0.0
>
> AND default_time_to_live = 0
>
> AND gc_grace_seconds = 3600
>
> AND max_index_interval = 2048
>
> AND memtable_flush_period_in_ms = 0
>
> AND min_index_interval = 128
>
> AND read_repair_chance = 0.0
>
> AND speculative_retry = 'NONE';
>
>
> Thanks,
> Kurt
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: STCS leaving sstables behind

2017-11-13 Thread Alexander Dejanovski

And actually, full repair with 3.0/3.x would have the same effect
(anticompation) unless you're using subrange repair.

On Mon, Nov 13, 2017 at 3:28 PM Jeff Jirsa <jji...@gmail.com> wrote:

> Running incremental repair puts sstables into a “repaired” set (and an
> unrepaired set), which results in something similar to what you’re
> describing.
>
> Were you running / did you run incremental repair ?
>
>
> --
> Jeff Jirsa
>
>
> On Nov 13, 2017, at 5:04 AM, Nicolas Guyomar <nicolas.guyo...@gmail.com>
> wrote:
>
> Hi everyone,
>
> I'm facing quite a strange behavior on STCS on 3.0.13, the strategy seems
> to have "forgotten" about old sstables, and started a completely new cycle
> from scratch, leaving the old sstables on disk untouched :
>
> Something happened on Nov 10 on every node, which resulted in all those
> sstables left behind :
>
> -rw-r--r--.  8 cassandra cassandra   15G Nov  9 22:22 mc-4828-big-Data.db
> -rw-r--r--.  8 cassandra cassandra  4.8G Nov 10 01:39 mc-4955-big-Data.db
> -rw-r--r--.  8 cassandra cassandra  2.4G Nov 10 01:45 mc-4957-big-Data.db
> -rw-r--r--.  8 cassandra cassandra  662M Nov 10 01:47 mc-4959-big-Data.db
> -rw-r--r--.  8 cassandra cassandra  2.8G Nov 10 03:46 mc-5099-big-Data.db
> -rw-r--r--.  8 cassandra cassandra  4.6G Nov 10 03:58 mc-5121-big-Data.db
> -rw-r--r--.  7 cassandra cassandra   53M Nov 10 08:45 mc-5447-big-Data.db
> -rw-r--r--.  7 cassandra cassandra  219M Nov 10 08:46 mc-5454-big-Data.db
> -rw-r--r--.  7 cassandra cassandra  650M Nov 10 08:46 mc-5452-big-Data.db
> -rw-r--r--.  7 cassandra cassandra  1.2G Nov 10 08:48 mc-5458-big-Data.db
> -rw-r--r--.  7 cassandra cassandra  1.5G Nov 10 08:50 mc-5465-big-Data.db
> -rw-r--r--.  7 cassandra cassandra  504M Nov 10 09:39 mc-5526-big-Data.db
> -rw-r--r--.  7 cassandra cassandra   57M Nov 10 09:40 mc-5527-big-Data.db
> -rw-r--r--.  7 cassandra cassandra  101M Nov 10 09:41 mc-5532-big-Data.db
> -rw-r--r--.  7 cassandra cassandra   86M Nov 10 09:41 mc-5533-big-Data.db
> -rw-r--r--.  7 cassandra cassandra  134M Nov 10 09:42 mc-5537-big-Data.db
> -rw-r--r--.  7 cassandra cassandra  3.9G Nov 10 09:54 mc-5538-big-Data.db
> *-rw-r--r--.  7 cassandra cassandra  1.3G Nov 10 09:57 mc-5548-big-Data.db*
> -rw-r--r--.  6 cassandra cassandra   16G Nov 11 01:23 mc-6474-big-Data.db
> -rw-r--r--.  4 cassandra cassandra   17G Nov 12 06:44 mc-7898-big-Data.db
> -rw-r--r--.  3 cassandra cassandra  8.2G Nov 12 13:45 mc-8226-big-Data.db
> -rw-r--r--.  2 cassandra cassandra  6.8G Nov 12 22:38 mc-8581-big-Data.db
> -rw-r--r--.  2 cassandra cassandra  6.1G Nov 13 03:10 mc-8937-big-Data.db
> -rw-r--r--.  2 cassandra cassandra  3.1G Nov 13 04:12 mc-9019-big-Data.db
> -rw-r--r--.  2 cassandra cassandra  3.0G Nov 13 05:56 mc-9112-big-Data.db
> -rw-r--r--.  2 cassandra cassandra  1.2G Nov 13 06:14 mc-9138-big-Data.db
> -rw-r--r--.  2 cassandra cassandra  1.1G Nov 13 06:27 mc-9159-big-Data.db
> -rw-r--r--.  2 cassandra cassandra  1.2G Nov 13 06:46 mc-9182-big-Data.db
> -rw-r--r--.  1 cassandra cassandra  1.9G Nov 13 07:18 mc-9202-big-Data.db
> -rw-r--r--.  1 cassandra cassandra  353M Nov 13 07:22 mc-9207-big-Data.db
> -rw-r--r--.  1 cassandra cassandra  120M Nov 13 07:22 mc-9208-big-Data.db
> -rw-r--r--.  1 cassandra cassandra  100M Nov 13 07:23 mc-9209-big-Data.db
> -rw-r--r--.  1 cassandra cassandra   67M Nov 13 07:25 mc-9210-big-Data.db
> -rw-r--r--.  1 cassandra cassandra   51M Nov 13 07:25 mc-9211-big-Data.db
> -rw-r--r--.  1 cassandra cassandra   73M Nov 13 07:27 mc-9212-big-Data.db
>
>
> TRACE logs for the Compaction Manager shows that sstables before Nov 10
> are grouped in different buckets than the one after Nov 10.
>
> At first I thought off some coldness behavior that would filter those
> "old" sstables, but looking at the code
> https://github.com/apache/cassandra/blob/cassandra-3.0/src/java/org/apache/cassandra/db/compaction/SizeTieredCompactionStrategy.java#L237
> I don't see any coldness or time pattern used to create bucket.
>
> I tried restarting the node but the buckets are still grouping in 2
> "groups" splitted around Nov 10
>
> I may have missed sthg from the logs, but they are clear from error/warn
> at that Nov 10 time
>
> For what it's worth, restarting the node fixed nodetool status from
> reporting a wrong Load (nearly 2TB per node instead à 300Gb) => we are
> loading some data for a week now, it seems that this can happen sometimes
>
> If anyone ever experienced that kind of behavior I'd be glad to know
> whether it is OK or not, I'd like to avoid manually triggering JMX
> UserDefinedCompaction ;)
>
> Thank you
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Not marking node down due to local pause

2017-10-20 Thread Alexander Dejanovski

Hi John,

the other main source of STW pause in the JVM is the safepoint mechanism :
http://blog.ragozin.info/2012/10/safepoints-in-hotspot-jvm.html

If you turn on full GC logging in your cassandra-env.sh file, you will find
lines like this :

2017-10-09T20:13:42.462+: 4.890: Total time for which application
threads were stopped: 0.0003137 seconds, Stopping threads took: 0.0001163
seconds
2017-10-09T20:13:42.472+: 4.899: Total time for which application
threads were stopped: 0.0001622 seconds, Stopping threads took: 0.361
seconds
2017-10-09T20:13:46.162+: 8.590: Total time for which application
threads were stopped: 2.6899536 seconds, Stopping threads took: 2.6899004
seconds
2017-10-09T20:13:46.162+: 8.590: Total time for which application
threads were stopped: 0.0002418 seconds, Stopping threads took: 0.456
seconds
2017-10-09T20:13:46.461+: 8.889: Total time for which application
threads were stopped: 0.0002654 seconds, Stopping threads took: 0.397
seconds
2017-10-09T20:13:46.478+: 8.906: Total time for which application
threads were stopped: 0.0001646 seconds, Stopping threads took: 0.791
seconds

These aren't GCs but still you can see that we have a 2.6s pause, with most
of the time spent waiting for threads to reach the safepoint.
When we saw this in the past, it was due to faulty disks that were
preventing the read threads from reaching the safepoint.

If you want to specifically identify the threads that were stuck, you can
set a timeout on the safepoints :

# GC logging options
JVM_OPTS="$JVM_OPTS -XX:+PrintGCDetails"
JVM_OPTS="$JVM_OPTS -XX:+PrintGCDateStamps"
JVM_OPTS="$JVM_OPTS -XX:+PrintHeapAtGC"
JVM_OPTS="$JVM_OPTS -XX:+PrintTenuringDistribution"
JVM_OPTS="$JVM_OPTS -XX:+PrintGCApplicationStoppedTime"
JVM_OPTS="$JVM_OPTS -XX:+PrintPromotionFailure"
JVM_OPTS="$JVM_OPTS -XX:+PrintSafepointStatistics"
JVM_OPTS="$JVM_OPTS -XX:+UnlockDiagnosticVMOptions -XX:+LogVMOutput
-XX:LogFile=/var/log/cassandra/vm.log"
JVM_OPTS="$JVM_OPTS -XX:+SafepointTimeout -XX:SafepointTimeoutDelay=5000"



Check the duration of the pauses you're seeing on your nodes and set a
shorter timeout (it should be fairly fast to reach safepoint). Above it is
set at 5s.
Restart your Cassandra process with the above settings, and wait for one
pause to happen. Then stop Cassandra and it will output informations in
the /var/log/cassandra/vm.log file (that only happens when the process
stops, nothing gets written there before that).

If indeed some threads were preventing the safepoint, they'll get listed
there.

Let us know how it goes.

Cheers,


On Fri, Oct 20, 2017 at 5:11 AM John Sanda <john.sa...@gmail.com> wrote:

> I have a small, two-node cluster running Cassandra 2.2.1. I am seeing a
> lot of these messages in both logs:
>
> WARN  07:23:16 Not marking nodes down due to local pause of 7219277694 >
> 50
>
> I am fairly certain that they are not due to GC. I am not seeing a whole
> of GC being logged and nothing over 500 ms. I do think it is I/O related.
>
> I am seeing lots of read timeouts for queries to a table that has a large
> growing number of SSTables. At last count there are over 1800 SSTables on
> one node. The count is lower on the other node, and I suspect that this is
> due to data distribution. Slowly but surely the number of SSTables keeps
> going up, and not surprisingly nodetool tablehistograms reports high
> latencies. The table is using STCS.
>
> I am seeing some but not a whole lot of dropped mutations. nodetool
> tpstats looks ok.
>
> The growing number of SSTables really makes me think this is an I/O issue.
> Casssandra is running in a kubernetes cluster using a SAN which is another
> reason I suspect I/O.
>
> What are some things I can look at/test to determine what is causing all
> of local pauses?
>
>
> - John
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: GC/CPU increase after upgrading to 3.0.14 (from 2.1.18)

2017-09-26 Thread Alexander Dejanovski

14 on our own, thus, any self-made configuration changes
>> (e.g. new gen heap size) for 2.1.18 are also in place with 3.0.14.
>>
>>
>>
>> What we see after a time-frame of ~ 7 days (so, e.g. should not be caused
>> by some sort of spiky compaction pattern) is an AVG increase in GC/CPU
>> (most likely correlating):
>>
>> · CPU: ~ 12% => ~ 17%
>>
>> · GC Suspension: ~ 1,7% => 3,29%
>>
>>
>>
>> In this environment not a big deal, but relatively we have a CPU increase
>> of ~ 50% (with increased GC most likely contributing). Something we have
>> deal with when going into production (going into larger, multi-node
>> loadtest environments first though).
>>
>>
>>
>> Beside the CPU/GC shift, we also monitor the following noticeable changes
>> (don’t know if they somehow correlate with the CPU/GC shift above):
>>
>> · Increased AVG Write Client Requests Latency (95th Percentile),
>> org.apache.cassandra.metrics.ClientRequest.Latency.Write: 6,05ms => 29,2ms,
>> but almost constant (no change in) write client request latency for our
>> particular keyspace only,
>> org.apache.cassandra.metrics.Keyspace.ruxitdb.WriteLatency
>>
>> · Compression metadata memory usage drop,
>> org.apache.cassandra.metrics.Keyspace.XXX.
>> CompressionMetadataOffHeapMemoryUsed: ~218MB => ~105MB => Good or bad?
>> Known?
>>
>>
>>
>> I know, looks all a bit vague, but perhaps someone else has seen
>> something similar when upgrading to 3.0.14 and can share their
>> thoughts/ideas. Especially the (relative) CPU/GC increase is something we
>> are curious about.
>>
>>
>>
>> Thanks a lot.
>>
>>
>>
>> Thomas
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Multi-node repair fails after upgrading to 3.0.14

2017-09-18 Thread Alexander Dejanovski

You could dig a bit more in the logs to see what precisely failed.
I suspect anticompaction to still be responsible for conflicts with
validation compaction (so you should see validation failures on some nodes).

The only way to fully disable anticompaction will be to run subrange
repairs.
The two easy solutions for that will be cassandra_range_repair
<https://github.com/BrianGallew/cassandra_range_repair> and reaper
<http://cassandra-reaper.io/>.

Reaper will offer better orchestration as it considers the whole token ring
and not just a single node at a time, and already includes a scheduler. It
also checks for pending compactions and slows down repairs if there are too
many (ie : your repair job may be generating a lot of new sstables which
can put you in a very very bad place...).
That said, you may find that cassandra_range_repair better suits your
scheduling/running habits.

Cheers,

On Mon, Sep 18, 2017 at 10:11 AM Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hi Alex,
>
>
>
> I now ran nodetool repair –full –pr keyspace cfs on all nodes in parallel
> and this may pop up now:
>
>
>
> 0.176.38.128 (progress: 1%)
>
> [2017-09-18 07:59:17,145] Some repair failed
>
> [2017-09-18 07:59:17,151] Repair command #3 finished in 0 seconds
>
> error: Repair job has failed with the error message: [2017-09-18
> 07:59:17,145] Some repair failed
>
> -- StackTrace --
>
> java.lang.RuntimeException: Repair job has failed with the error message:
> [2017-09-18 07:59:17,145] Some repair failed
>
> at
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:115)
>
> at
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
>
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
>
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
>
> at
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
>
>
>
> 2017-09-18 07:59:17 repair finished
>
>
>
>
>
> If running the above nodetool call sequentially on all nodes, repair
> finishes without printing a stack trace.
>
>
>
> The error message and stack trace isn’t really useful here. Any further
> ideas/experiences?
>
>
>
> Thanks,
>
> Thomas
>
>
>
> *From:* Alexander Dejanovski [mailto:a...@thelastpickle.com]
> *Sent:* Freitag, 15. September 2017 11:30
>
>
> *To:* user@cassandra.apache.org
> *Subject:* Re: Multi-node repair fails after upgrading to 3.0.14
>
>
>
> Right, you should indeed add the "--full" flag to perform full repairs,
> and you can then keep the "-pr" flag.
>
>
>
> I'd advise to monitor the status of your SSTables as you'll probably end
> up with a pool of SSTables marked as repaired, and another pool marked as
> unrepaired which won't be compacted together (hence the suggestion of
> running subrange repairs).
>
> Use sstablemetadata to check on the "Repaired at" value for each. 0 means
> unrepaired and any other value (a timestamp) means the SSTable has been
> repaired.
>
> I've had behaviors in the past where running "-pr" on the whole cluster
> would still not mark all SSTables as repaired, but I can't say if that
> behavior has changed in latest versions.
>
>
>
> Having separate pools of SStables that cannot be compacted means that you
> might have tombstones that don't get evicted due to partitions living in
> both states (repaired/unrepaired).
>
>
>
> To sum up the recommendations :
>
> - Run a full repair with both "--full" and "-pr" and check that SSTables
> are properly marked as repaired
>
> - Use a tight repair schedule to avoid keeping partitions for too long in
> both repaired and unrepaired state
>
> - Switch to subrange repair if you want to fully avoid marking SSTables as
> repaired (which you don't need anyway since you're not using incremental
> repairs). If you wish to do this, you'll have to mark back all your
> sstables to unrepaired, using nodetool sstablerepairedset
> <https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsSSTableRepairedSet.html>
> .
>
>
>
> Cheers,
>
>
>
> On Fri, Sep 15, 2017 at 10:27 AM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hi Alex,
>
>
>
> thanks a lot. Somehow missed that incremental r

Re: Multi-node repair fails after upgrading to 3.0.14

2017-09-15 Thread Alexander Dejanovski

Right, you should indeed add the "--full" flag to perform full repairs, and
you can then keep the "-pr" flag.

I'd advise to monitor the status of your SSTables as you'll probably end up
with a pool of SSTables marked as repaired, and another pool marked as
unrepaired which won't be compacted together (hence the suggestion of
running subrange repairs).
Use sstablemetadata to check on the "Repaired at" value for each. 0 means
unrepaired and any other value (a timestamp) means the SSTable has been
repaired.
I've had behaviors in the past where running "-pr" on the whole cluster
would still not mark all SSTables as repaired, but I can't say if that
behavior has changed in latest versions.

Having separate pools of SStables that cannot be compacted means that you
might have tombstones that don't get evicted due to partitions living in
both states (repaired/unrepaired).

To sum up the recommendations :
- Run a full repair with both "--full" and "-pr" and check that SSTables
are properly marked as repaired
- Use a tight repair schedule to avoid keeping partitions for too long in
both repaired and unrepaired state
- Switch to subrange repair if you want to fully avoid marking SSTables as
repaired (which you don't need anyway since you're not using incremental
repairs). If you wish to do this, you'll have to mark back all your
sstables to unrepaired, using nodetool sstablerepairedset
<https://docs.datastax.com/en/cassandra/2.1/cassandra/tools/toolsSSTableRepairedSet.html>
.

Cheers,

On Fri, Sep 15, 2017 at 10:27 AM Steinmaurer, Thomas <
thomas.steinmau...@dynatrace.com> wrote:

> Hi Alex,
>
>
>
> thanks a lot. Somehow missed that incremental repairs are the default now.
>
>
>
> We have been happy with full repair so far, cause data what we currently
> manually invoke for being prepared is a small (~1GB or even smaller).
>
>
>
> So I guess with full repairs across all nodes, we still can stick with the
> partition range (-pr) option, but with 3.0 we additionally have to provide
> the –full option, right?
>
>
>
> Thanks again,
>
> Thomas
>
>
>
> *From:* Alexander Dejanovski [mailto:a...@thelastpickle.com]
> *Sent:* Freitag, 15. September 2017 09:45
> *To:* user@cassandra.apache.org
> *Subject:* Re: Multi-node repair fails after upgrading to 3.0.14
>
>
>
> Hi Thomas,
>
>
>
> in 2.1.18, the default repair mode was full repair while since 2.2 it is
> incremental repair.
>
> So running "nodetool repair -pr" since your upgrade to 3.0.14 doesn't
> trigger the same operation.
>
>
>
> Incremental repair cannot run on more than one node at a time on a
> cluster, because you risk to have conflicts with sessions trying to
> anticompact and run validation compactions on the same SSTables (which will
> make the validation phase fail, like your logs are showing).
>
> Furthermore, you should never use "-pr" with incremental repair because it
> is useless in that mode, and won't properly perform anticompaction on all
> nodes.
>
>
>
> If you were happy with full repairs in 2.1.18, I'd suggest to stick with
> those in 3.0.14 as well because there are still too many caveats with
> incremental repairs that should hopefully be fixed in 4.0+.
>
> Note that full repair will also trigger anticompaction and mark SSTables
> as repaired in your release of Cassandra, and only full subrange repairs
> are the only flavor that will skip anticompaction.
>
>
>
> You will need some tooling to help with subrange repairs though, and I'd
> recommend to use Reaper which handles automation for you :
> http://cassandra-reaper.io/
>
>
>
> If you decide to stick with incremental repairs, first perform a rolling
> restart of your cluster to make sure no repair session still runs, and run
> "nodetool repair" on a single node at a time. Move on to the next node only
> when nodetool or the logs show that repair is over (which will include the
> anticompaction phase).
>
>
>
> Cheers,
>
>
>
>
>
>
>
> On Fri, Sep 15, 2017 at 8:42 AM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
> Hello,
>
>
>
> we are currently in the process of upgrading from 2.1.18 to 3.0.14. After
> upgrading a few test environments, we start to see some suspicious log
> entries regarding repair issues.
>
>
>
> We have a cron job on all nodes basically executing the following repair
> call on a daily basis:
>
>
>
> nodetool repair –pr 
>
>
>
> This gets started on all nodes at the same time. While this has worked
> with 2.1.18 (at least we haven’t seen anything suspicious in Cassandra
> log), with 3.0.14 we get something similar

Re: Multi-node repair fails after upgrading to 3.0.14

2017-09-15 Thread Alexander Dejanovski

44206601,8149858096109302285],
> (3975126143101303723,3980729378827590597],
> (-956691623200349709,-946602525018301692],
> (-82499927325251331,-79866884352549492],
> (3952144214544622998,3955602392726495936],
> (8154760186218662205,8157079055586089583],
> (3840595196718778916,3866458971850198755],
> (-1066905024007783341,-1055954824488508260],
> (-7252356975874511782,-7246799942440641887],
> (-810612946397276081,-792189809286829222],
> (4964519403172053705,4970446606512414858],
> (-5380038118840759647,-5370953856683870319],
> (-3221630728515706463,-3206856875356976885],
> (-1193448110686154165,-1161640137086921883],
> (-3356304907368646189,-3346460884208327912],
> (3466596314109623830,346814432669172],
> (-9050241313548454460,-9005441616028750657],
> (402227699082311580,407458511300218383]]] Validation failed in /FAKE.33.64
>
> at
> org.apache.cassandra.repair.ValidationTask.treesReceived(ValidationTask.java:68)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>
> at
> org.apache.cassandra.repair.RepairSession.validationComplete(RepairSession.java:178)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>
> at
> org.apache.cassandra.service.ActiveRepairService.handleMessage(ActiveRepairService.java:486)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>
> at
> org.apache.cassandra.repair.RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:164)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>
> at
> org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:67)
> ~[apache-cassandra-3.0.14.jar:3.0.14]
>
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_102]
>
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[na:1.8.0_102]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [na:1.8.0_102]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_102]
>
> at
> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:79)
> [apache-cassandra-3.0.14.jar:3.0.14]
>
> at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_102]
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freist
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
> ädterstra
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
> ße 313
> <https://maps.google.com/?q=4040+Linz,+Austria,+Freist%C3%A4dterstra%C3%9Fe+313=gmail=g>
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Regular dropped READ messages

2017-06-06 Thread Alexander Dejanovski

table WHERE app = ? AND platform = ? AND slug = ? AND partition =
> ? AND user_id >= ? LIMIT ?
>
> partition is basically an integer that goes from 0 to 15, and we always
> select the 16 partitions in parallel.
>
> Note that we write constantly to this table, to update some fields, insert
> the user into the new "slug" (a slug is an amalgamation of different
> parameters like state, timezone etc that allows us the efficiently query
> all users from a particular "app" with a given "slug". At least that's the
> idea, as seen here it causes us some trouble).
>
> And yes, we do use batches to write this data, this is how we process each
> user update:
>   - SELECT from a "master" slug to get the fields we need
>   - from that, compute a list of slugs the user had and a list of slugs
> the user should have (for example if he changes timezone we have to update
> the slug)
>   - delete the user from the slug he shouldn't be in and insert the user
> where he should be.
> The last part, delete/insert is done in a logged batch.
>
> I hope it's relatively clear.
>
> On Tue, Jun 6, 2017, at 02:46 PM, Alexander Dejanovski wrote:
>
> Hi Vincent,
>
> dropped messages are indeed common in case of long GC pauses.
> Having 4s to 6s pauses is not normal and is the sign of an unhealthy
> cluster. Minor GCs are usually faster but you can have long ones too.
>
> If you can share your hardware specs along with your current GC settings
> (CMS or G1, heap size, young gen size) and a distribution of GC pauses
> (rate of minor GCs, average and max duration of GCs) we could try to help
> you tune your heap settings.
> You can activate full GC logging which could help in fine tuning
> MaxTenuringThreshold and survivor space sizing.
>
> You should also check for max partition sizes and number of SSTables
> accessed per read. Run nodetool cfstats/cfhistograms on your tables to get
> both. p75 should be less or equal to 4 in number of SSTables  and you
> shouldn't have partitions over... let's say 300 MBs. Partitions > 1GB are a
> critical problem to address.
>
> Other things to consider are :
> Do you read from a single partition for each query ?
> Do you use collections that could spread over many SSTables ?
> Do you use batches for writes (although your problem doesn't seem to be
> write related) ?
> Can you share the queries from your scheduled selects and the data model ?
>
> Cheers,
>
>
> On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann <m...@vrischmann.me> wrote:
>
>
> Hi,
>
> we have a cluster of 11 nodes running Cassandra 2.2.9 where we regularly
> get READ messages dropped:
>
> > READ messages were dropped in last 5000 ms: 974 for internal timeout and
> 0 for cross node timeout
>
> Looking at the logs, some are logged at the same time as Old Gen GCs.
> These GCs all take around 4 to 6s to run. To me, it's "normal" that these
> could cause reads to be dropped.
> However, we also have reads dropped without Old Gen GCs occurring, only
> Young Gen.
>
> I'm wondering if anyone has a good way of determining what the _root_
> cause could be. Up until now, the only way we managed to decrease load on
> our cluster was by guessing some stuff, trying it out and being lucky,
> essentially. I'd love a way to make sure what the problem is before
> tackling it. Doing schema changes is not a problem, but changing stuff
> blindly is not super efficient :)
>
> What I do see in the logs, is that these happen almost exclusively when we
> do a lot of SELECT.  The time logged almost always correspond to times
> where our schedules SELECTs are happening. That narrows the scope a little,
> but still.
>
> Anyway, I'd appreciate any information about troubleshooting this scenario.
> Thanks.
>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Regular dropped READ messages

2017-06-06 Thread Alexander Dejanovski

Hi Vincent,

dropped messages are indeed common in case of long GC pauses.
Having 4s to 6s pauses is not normal and is the sign of an unhealthy
cluster. Minor GCs are usually faster but you can have long ones too.

If you can share your hardware specs along with your current GC settings
(CMS or G1, heap size, young gen size) and a distribution of GC pauses
(rate of minor GCs, average and max duration of GCs) we could try to help
you tune your heap settings.
You can activate full GC logging which could help in fine tuning
MaxTenuringThreshold and survivor space sizing.

You should also check for max partition sizes and number of SSTables
accessed per read. Run nodetool cfstats/cfhistograms on your tables to get
both. p75 should be less or equal to 4 in number of SSTables  and you
shouldn't have partitions over... let's say 300 MBs. Partitions > 1GB are a
critical problem to address.

Other things to consider are :
Do you read from a single partition for each query ?
Do you use collections that could spread over many SSTables ?
Do you use batches for writes (although your problem doesn't seem to be
write related) ?
Can you share the queries from your scheduled selects and the data model ?

Cheers,

On Tue, Jun 6, 2017 at 2:33 PM Vincent Rischmann <m...@vrischmann.me> wrote:

> Hi,
>
> we have a cluster of 11 nodes running Cassandra 2.2.9 where we regularly
> get READ messages dropped:
>
> > READ messages were dropped in last 5000 ms: 974 for internal timeout and
> 0 for cross node timeout
>
> Looking at the logs, some are logged at the same time as Old Gen GCs.
> These GCs all take around 4 to 6s to run. To me, it's "normal" that these
> could cause reads to be dropped.
> However, we also have reads dropped without Old Gen GCs occurring, only
> Young Gen.
>
> I'm wondering if anyone has a good way of determining what the _root_
> cause could be. Up until now, the only way we managed to decrease load on
> our cluster was by guessing some stuff, trying it out and being lucky,
> essentially. I'd love a way to make sure what the problem is before
> tackling it. Doing schema changes is not a problem, but changing stuff
> blindly is not super efficient :)
>
> What I do see in the logs, is that these happen almost exclusively when we
> do a lot of SELECT.  The time logged almost always correspond to times
> where our schedules SELECTs are happening. That narrows the scope a little,
> but still.
>
> Anyway, I'd appreciate any information about troubleshooting this scenario.
> Thanks.
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: AWS Cassandra backup/Restore tools

2017-05-12 Thread Alexander Dejanovski

Hi,

here are the main techniques that I know of to perform backups for
Cassandra :

   - Tablesnap (https://github.com/JeremyGrosser/tablesnap) : performs
   continuous backups on S3. Comes with tableslurp to restore backups (one
   table at a time only) and tablechop to delete outdated sstables from S3.
   - incremental backup : activate it in the cassandra.yaml file and it
   will create snapshots for all newly flushed SSTables. It's up to you to
   move the snapshots off-node and delete them. I don't really like that
   technique since it creates a lot of small sstables that eventually contain
   a lot of outdated data. Upon restore you'll have to wait until compaction
   catches up on compacting all the history (which could take a while and use
   a lot of power). Your backups could also grow indefinitely with this
   technique since there's no compaction, so no purge. You'll have to build
   the restore script/procedure.
   - scheduled snapshots : you perform full snapshots by yourself and move
   them off node. You'll have to build the restore script/procedure.
   - EBS snapshots : probably the easiest way to perform backups if you are
   using M4/R4 instances on AWS.


Cheers,

On Thu, May 11, 2017 at 11:01 PM Manikandan Srinivasan <
msriniva...@datastax.com> wrote:

> Blake is correct. OpsCenter 6.0 and up doesn't work with OSS C*. @Nitan:
> We have made some substantial changes to the Opscenter 6.1 backup service,
> specifically when it comes to S3 backups. Having said this, I am not going
> to be sale-sy here. If folks need some help or need more clarity to know
> more about these improvements, please send me an email directly:
> msriniva...@datastax.com
>
> Regards
> Mani
>
> On Thu, May 11, 2017 at 1:54 PM, Nitan Kainth <ni...@bamlabs.com> wrote:
>
>> Also , Opscenter backup/restore does not work for large databases
>>
>> Sent from my iPhone
>>
>> On May 11, 2017, at 3:41 PM, Blake Eggleston <beggles...@apple.com>
>> wrote:
>>
>> OpsCenter 6.0 and up don't work with Cassandra.
>>
>> On May 11, 2017 at 12:31:08 PM, cass savy (casss...@gmail.com) wrote:
>>
>> AWS Backup/Restore process/tools for C*/DSE C*:
>>
>> Has anyone used Opscenter 6.1 backup tool to backup/restore data for
>> larger datasets online ?
>>
>> If yes, did you run into issues using that tool to backup/restore data in
>> PROD that caused any performance or any other impact to the cluster?
>>
>> If no, what are other tools that people have used or recommended for
>> backup and restore of Cassandra keyspaces?
>>
>> Please advice.
>>
>>
>>
>
>
> --
> Regards,
>
> Manikandan Srinivasan
>
> Director, Product Management| +1.408.887.3686 |
> manikandan.sriniva...@datastax.com
>
> [image: linkedin.png] <http://www.linkedin.com/in/srinivm/> [image:
> facebook.png] <https://www.facebook.com/datastax> [image: twitter.png]
> <https://twitter.com/mani_srini> [image: g+.png]
> <https://plus.google.com/+Datastax/about>
> <http://feeds.feedburner.com/datastax>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Cassandra 3.10 has partial partition key search but does it result in a table scan?

2017-05-09 Thread Alexander Dejanovski

Hi Kant,

Unless you provide the full partition key, I see no way for Cassandra to
avoid doing a full table scan.
In order to know on which specific nodes to search (and in which sstables
,etc...) it needs to have a token. The token is a hash of the whole
partition key.
For a specific value of column "a" and different values of column "b" you
always end up with different tokens that have no guaranty to be stored on
the same node.
After that, bloom filters, partition indexes, etc... require the full token
too, so a full scan is further necessary on each node to get the data.

TL;DR : no way to avoid a full cluster scan unless you provide the full
partition key in your where clause.

Cheers,

On Tue, May 9, 2017 at 4:24 PM Jon Haddad <jonathan.had...@gmail.com> wrote:

> Nope, I didn’t comment on that query.   I specifically answered your
> question about "select * from hello where a='foo' allow filtering;”
>
> The query you’ve listed here looks like it would also do a full table scan
> (again, I don’t see how it would be avoided).
>
> I recommend firing up a 3 node cluster using CCM, creating a key space
> with RF=1, and seeing what it does.
>
> On May 9, 2017, at 9:12 AM, Kant Kodali <k...@peernova.com> wrote:
>
> Hi,
>
> Are you saying The following query select max(b) from hello where a='a1'
> allow filtering; doesn't result in a table scan? I got the result for
> this query and yes I just tried tracing it and looks like it is indeed
> doing a table scan on ReadStage-2 although I am not sure if I am
> interpreting it right? Finally is there anyway to prevent table scan while
> providing the partial partition key and get the max b ?
>
> 
> 
>
>
> On Tue, May 9, 2017 at 6:33 AM, Jon Haddad <jonathan.had...@gmail.com>
> wrote:
>
>> I don’t see any way it wouldn’t.  Have you tried tracing it?
>>
>> > On May 9, 2017, at 8:32 AM, Kant Kodali <k...@peernova.com> wrote:
>> >
>> > Hi All,
>> >
>> > It looks like Cassandra 3.10 has partial partition key search but does
>> it result in a table scan? for example I can have the following
>> >
>> > create table hello(
>> > a text,
>> > b int,
>> > c text,
>> > d text,
>> > primary key((a,b), c)
>> > );
>> >
>> > Now I can do select * from hello where a='foo' allow filtering;// This
>> works in 3.10 but I wonder if this query results in table scan and if so is
>> there any way to limit such that I get max b?
>> >
>> > Thanks!
>>
>>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: manual deletes with TWCS

2017-05-06 Thread Alexander Dejanovski

Hi John,

if all your data is TTLed then you'll be fine and purge should happen in
due time as long as your sstables don't overlap on timestamp (which can
only happen through repair mechanisms).
The tombstones will get purged when the ssatble that contain them also
fully expire.

Cheers

Le ven. 5 mai 2017 à 23:04, John Sanda <john.sa...@gmail.com> a écrit :

> This is involving TTLed data, and I actually would want to delete all
> related partitions across all time windows. Let's say I have a time series
> partitioned by day with a 7 day TTL and a window size of one day. If I
> delete partitions for the past seven days, would I still run into the issue
> of data purge being postponed?
>
> On Fri, May 5, 2017 at 4:57 PM, Jon Haddad <jonathan.had...@gmail.com>
> wrote:
>
>> You cannot.
>>
>> From Alex’s TLP post:
>> http://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>>
>> TWCS is no fit for workload that perform deletes on non TTLed data.
>> Consider that SSTables from different time windows will never be compacted
>> together, so data inserted on day 1 and deleted on day 2 will have the
>> tombstone and the shadowed cells living in different time windows. Unless a
>> major compaction is performed (which shouldn’t), and while the deletion
>> will seem effective when running queries, space will never be reclaimed on
>> disk.
>> Deletes can be performed on TTLed data if needed, but the partition will
>> then exist in different time windows, which will postpone actual deletion
>> from disk until both time windows fully expire.
>>
>>
>> On May 5, 2017, at 1:54 PM, John Sanda <john.sa...@gmail.com> wrote:
>>
>> How problematic is it to perform deletes when using TWCS? I am currently
>> using TWCS and have some new use cases for performing deletes. So far I
>> have avoided performing deletes, but I am wondering what issues I might run
>> into.
>>
>>
>> - John
>>
>>
>>
>
>
> --
>
> - John
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: cassandra OOM

2017-04-03 Thread Alexander Dejanovski

Hi,

we've seen G1GC going OOM on production clusters (repeatedly) with a 16GB
heap when the workload is intense, and given you're running on m4.2xl I
wouldn't go over 16GB for the heap.

I'd suggest to revert back to CMS, using a 16GB heap and up to 6GB of new
gen. You can use 5 as MaxTenuringThreshold as an initial value and activate
GC logging to fine tune the settings afterwards.

FYI CMS tends to perform better than G1 even though it's a little bit
harder to tune.

Cheers,

On Mon, Apr 3, 2017 at 10:54 PM Gopal, Dhruva <dhruva.go...@aspect.com>
wrote:

> 16 Gig heap, with G1. Pertinent info from jvm.options below (we’re using
> m2.2xlarge instances in AWS):
>
>
>
>
>
> #
>
> # HEAP SETTINGS #
>
> #
>
>
>
> # Heap size is automatically calculated by cassandra-env based on this
>
> # formula: max(min(1/2 ram, 1024MB), min(1/4 ram, 8GB))
>
> # That is:
>
> # - calculate 1/2 ram and cap to 1024MB
>
> # - calculate 1/4 ram and cap to 8192MB
>
> # - pick the max
>
> #
>
> # For production use you may wish to adjust this for your environment.
>
> # If that's the case, uncomment the -Xmx and Xms options below to override
> the
>
> # automatic calculation of JVM heap memory.
>
> #
>
> # It is recommended to set min (-Xms) and max (-Xmx) heap sizes to
>
> # the same value to avoid stop-the-world GC pauses during resize, and
>
> # so that we can lock the heap in memory on startup to prevent any
>
> # of it from being swapped out.
>
> -Xms16G
>
> -Xmx16G
>
>
>
> # Young generation size is automatically calculated by cassandra-env
>
> # based on this formula: min(100 * num_cores, 1/4 * heap size)
>
> #
>
> # The main trade-off for the young generation is that the larger it
>
> # is, the longer GC pause times will be. The shorter it is, the more
>
> # expensive GC will be (usually).
>
> #
>
> # It is not recommended to set the young generation size if using the
>
> # G1 GC, since that will override the target pause-time goal.
>
> # More info:
> http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html
>
> #
>
> # The example below assumes a modern 8-core+ machine for decent
>
> # times. If in doubt, and if you do not particularly want to tweak, go
>
> # 100 MB per physical CPU core.
>
> #-Xmn800M
>
>
>
> #
>
> #  GC SETTINGS  #
>
> #
>
>
>
> ### CMS Settings
>
>
>
> #-XX:+UseParNewGC
>
> #-XX:+UseConcMarkSweepGC
>
> #-XX:+CMSParallelRemarkEnabled
>
> #-XX:SurvivorRatio=8
>
> #-XX:MaxTenuringThreshold=1
>
> #-XX:CMSInitiatingOccupancyFraction=75
>
> #-XX:+UseCMSInitiatingOccupancyOnly
>
> #-XX:CMSWaitDuration=1
>
> #-XX:+CMSParallelInitialMarkEnabled
>
> #-XX:+CMSEdenChunksRecordAlways
>
> # some JVMs will fill up their heap when accessed via JMX, see
> CASSANDRA-6541
>
> #-XX:+CMSClassUnloadingEnabled
>
>
>
> ### G1 Settings (experimental, comment previous section and uncomment
> section below to enable)
>
>
>
> ## Use the Hotspot garbage-first collector.
>
> -XX:+UseG1GC
>
> #
>
> ## Have the JVM do less remembered set work during STW, instead
>
> ## preferring concurrent GC. Reduces p99.9 latency.
>
> -XX:G1RSetUpdatingPauseTimePercent=5
>
> #
>
> ## Main G1GC tunable: lowering the pause target will lower throughput and
> vise versa.
>
> ## 200ms is the JVM default and lowest viable setting
>
> ## 1000ms increases throughput. Keep it smaller than the timeouts in
> cassandra.yaml.
>
> -XX:MaxGCPauseMillis=500
>
>
>
> ## Optional G1 Settings
>
>
>
> # Save CPU time on large (>= 16GB) heaps by delaying region scanning
>
> # until the heap is 70% full. The default in Hotspot 8u40 is 40%.
>
> -XX:InitiatingHeapOccupancyPercent=70
>
>
>
> # For systems with > 8 cores, the default ParallelGCThreads is 5/8 the
> number of logical cores.
>
> # Otherwise equal to the number of cores when 8 or less.
>
> # Machines with > 10 cores should try setting these to <= full cores.
>
> #-XX:ParallelGCThreads=16
>
> # By default, ConcGCThreads is 1/4 of ParallelGCThreads.
>
> # Setting both to the same value can reduce STW durations.
>
> #-XX:ConcGCThreads=16
>
>
>
> ### GC logging options -- uncomment to enable
>
>
>
> #-XX:+PrintGCDetails
>
> #-XX:+PrintGCDateStamps
>
> #-XX:+PrintHeapAtGC
>
> #-XX:+PrintTenuringDistribution
>
> #-XX:+PrintGCApplicationStoppedTime
>
> #-XX:+PrintPromotionFailure
>
> #-XX:PrintFLSStatistics=1
>
> #-Xloggc

Re: cassandra OOM

2017-04-03 Thread Alexander Dejanovski

Hi,

could you share your GC settings ? G1 or CMS ? Heap size, etc...

Thanks,

On Sun, Apr 2, 2017 at 10:30 PM Gopal, Dhruva <dhruva.go...@aspect.com>
wrote:

> Hi –
>
>   We’ve had what looks like an OOM situation with Cassandra (we have a
> dump file that got generated) in our staging (performance/load testing
> environment) and I wanted to reach out to this user group to see if you had
> any recommendations on how we should approach our investigation as to the
> cause of this issue. The logs don’t seem to point to any obvious issues,
> and we’re no experts in analyzing this by any means, so was looking for
> guidance on how to proceed. Should we enter a Jira as well? We’re on
> Cassandra 3.9, and are running  a six node cluster. This happened in a
> controlled load testing environment. Feedback will be much appreciated!
>
>
>
>
>
> Regards,
>
> Dhruva
>
>
> This email (including any attachments) is proprietary to Aspect Software,
> Inc. and may contain information that is confidential. If you have received
> this message in error, please do not read, copy or forward this message.
> Please notify the sender immediately, delete it from your system and
> destroy any copies. You may not further disclose or distribute this email
> or its attachments.
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: How can I scale my read rate?

2017-03-27 Thread Alexander Dejanovski

By default the TokenAwarePolicy does shuffle replicas, and it can be
disabled if you want to only hit the primary replica for the token range
you're querying :
http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html

On Mon, Mar 27, 2017 at 9:41 AM Avi Kivity <a...@scylladb.com> wrote:

> Is the driver doing the right thing by directing all reads for a given
> token to the same node?  If that node fails, then all of those reads will
> be directed at other nodes, all oh whom will be cache-cold for the the
> failed node's primary token range.  Seems like the driver should distribute
> reads among the all the replicas for a token, at least as an option, to
> keep the caches warm for latency-sensitive loads.
>
> On 03/26/2017 07:46 PM, Eric Stevens wrote:
>
> Yes, throughput for a given partition key cannot be improved with
> horizontal scaling.  You can increase RF to theoretically improve
> throughput on that key, but actually in this case smart clients might hold
> you back, because they're probably token aware, and will try to serve that
> read off the key's primary replica, so all reads would be directed at a
> single node for that key.
>
> If you're reading at CL=QUORUM, there's a chance that increasing RF will
> actually reduce performance rather than improve it, because you've
> increased the total amount of work to serve the read (as well as the
> write).  If you're reading at CL=ONE, increasing RF will increase the
> chances of falling afoul of eventual consistency.
>
> However that's not really a real-world scenario.  Or if it is, Cassandra
> is probably the wrong tool to satisfy that kind of workload.
>
> On Thu, Mar 23, 2017 at 11:43 PM Alain Rastoul <alf.mmm@gmail.com>
> wrote:
>
> On 24/03/2017 01:00, Eric Stevens wrote:
> > Assuming an even distribution of data in your cluster, and an even
> > distribution across those keys by your readers, you would not need to
> > increase RF with cluster size to increase read performance.  If you have
> > 3 nodes with RF=3, and do 3 million reads, with good distribution, each
> > node has served 1 million read requests.  If you increase to 6 nodes and
> > keep RF=3, then each node now owns half as much data and serves only
> > 500,000 reads.  Or more meaningfully in the same time it takes to do 3
> > million reads under the 3 node cluster you ought to be able to do 6
> > million reads under the 6 node cluster since each node is just
> > responsible for 1 million total reads.
> >
> Hi Eric,
>
> I think I got your point.
> In case of really evenly distributed  reads it may (or should?) not make
> any difference,
>
> But when you do not distribute well the reads (and in that case only),
> my understanding about RF was that it could help spreading the load :
> In that case, with RF= 4 instead of 3,  with several clients accessing keys
> same key ranges, a coordinator could pick up one node to handle the request
> in 4 replicas instead of picking up one node in 3 , thus having
> more "workers" to handle a request ?
>
> Am I wrong here ?
>
> Thank you for the clarification
>
>
> --
> best,
> Alain
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Help with data modelling (from MySQL to Cassandra)

2017-03-27 Thread Alexander Dejanovski

Hi Zoltan,

you must try to avoid multi partition queries as much as possible. Instead,
use asynchronous queries to grab several partitions concurrently.
Try to send no more than  ~100 queries at the same time to avoid DDOS-ing
your cluster.
This would leave you roughly with 1000+ async queries groups to run.
Performance will really depend on your hardware, consistency level, load
balancing policy, partition fragmentation (how many updates you'll run on
each element over time) and the SLA you're expecting.

If that approach doesn't meet your SLA requirements, you can try to use
wide partitions and group elements under buckets :

CREATE TABLE elements (
doc_id long,
bucket long,
element_id long,
element_content text,
PRIMARY KEY((doc_id, bucket), element_id)
)

The bucket here could be a modulus of the element_id (or of the hash of
element_id if it is not a numerical value). This way you can spread
elements over the cluster and access them directly if you have the doc_id
and the element_id to perform updates.
You'll get to run less queries concurrently but they'll take more time than
individual ones in the first scenario (1 partition per element). You should
benchmark both solutions to see which one gives best performance.
Bucket your elements so that your partitions don't grow over 100MB. Large
partitions are silent cluster killers (1GB+ partitions are a direct threat
to cluster stability)...

To ensure best performance, use prepared statements along with the
TokenAwarePolicy
<http://docs.datastax.com/en/drivers/java/3.0/com/datastax/driver/core/policies/TokenAwarePolicy.html>
to
avoid unnecessary coordination.

Cheers,

On Mon, Mar 27, 2017 at 4:40 AM Zoltan Lorincz <zol...@gmail.com> wrote:

> Querying by (doc_id and element_id ) OR just by (element_id) is fine, but
> the real question is, will it be efficient to query 100k+ primary keys in
> the elements table?
> e.g.
>
> SELECT * FROM elements WHERE element_id IN (element_id1, element_id2,
> element_id3,  element_id100K+)  ?
>
> The elements_id is a primary key.
>
> Thank you?
>
>
> On Sun, Mar 26, 2017 at 11:35 PM, Matija Gobec <matija0...@gmail.com>
> wrote:
>
> Have one table hold document metadata (doc_id, title, description, ...)
> and have another table elements where partition key is doc_id and
> clustering key is element_id.
> Only problem here is if you need to query and/or update element just by
> element_id but I don't know your queries up front.
>
> On Sun, Mar 26, 2017 at 10:16 PM, Zoltan Lorincz <zol...@gmail.com> wrote:
>
> Dear cassandra users,
>
> We have the following structure in MySql:
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(index), title, description]
>
> Notation: table name->[column1(key or index), column2, …]
>
> We want to transfer the data to Cassandra.
>
> Each document can contain a large number of elements (between 1 and 100k+)
>
> We have two requirements:
> a) Load all elements for a given doc_id quickly
> b) Update the value of one individual element quickly
>
>
> We were thinking on the following cassandra configurations:
>
> Option A
>
> documents->[doc_id(primary key), title, description, elements] (elements
> could be a SET or a TEXT, each time new elements are added (they are never
> removed) we would append it to this column)
> elements->[element_id(primary key), title, description]
>
> Loading a document:
>
>  a) Load document with given  and get all element ids
> SELECT * from documents where doc_id=‘id’
>
>  b) Load all elements with the given ids
> SELECT * FROM elements where element_id IN (ids loaded from query a)
>
>
> Option B
>
> documents->[doc_id(primary key), title, description]
> elements->[element_id(primary key), doc_id(secondary index), title,
> description]
>
> Loading a document:
>  a) SELECT * from elements where doc_id=‘id’
>
>
> Neither solutions doesn’t seem to be good, in Option A, even if we are
> querying by Primary keys, the second query will have 100k+ primary key id’s
> in the WHERE clause, and the second solution looks like an anti pattern in
> cassandra.
>
> Could anyone give any advice how would we create a model for our use case?
>
> Thank you in advance,
> Zoltan.
>
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Disk full during new node bootstrap

2017-02-04 Thread Alexander Dejanovski

Hi,

could you share with us the following informations ?

- "nodetool status" output
- Keyspace definitions (we need to check the replication strategy you're
using on all keyspaces)
- Specifics about what you're calling "groups" in a DC. Are these racks ?

Thanks

On Sat, Feb 4, 2017 at 10:41 AM laxmikanth sadula <laxmikanth...@gmail.com>
wrote:

> Yes .. same number of tokens...
> 256
>
> On Sat, Feb 4, 2017 at 11:56 AM, Jonathan Haddad <j...@jonhaddad.com>
> wrote:
>
> Are you using the same number of tokens on the new node as the old ones?
>
> On Fri, Feb 3, 2017 at 8:31 PM techpyaasa . <techpya...@gmail.com> wrote:
>
> Hi,
>
> We are using c* 2.0.17 , 2 DCs , RF=3.
>
> When I try to add new node to one group in a DC , I got disk full. Can
> someone please tell what is the best way to resolve this?
>
> Run compaction for nodes in that group(to which I'm going to add new node,
> as data streams to new nodes from nodes of group to which it is added)
>
> OR
>
> Boootstrap/add  2(multiple nodes) at a time?
>
>
> Please suggest better way to fix this.
>
> Thanks in advance
>
> Techpyaasa
>
>
>
>
> --
> Regards,
> Laxmikanth
> 99621 38051
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Why compacting process uses more data that is expected

2017-01-04 Thread Alexander Dejanovski

Indeed, nodetool compactionstats shows uncompressed sizes.
As Oleksandr suggests, use the table compression ratio to compute the
actual size on disk.

It would actually be a great improvement for ops if we could add a switch
to compactionstats in order to have the compression ratio applied
automatically.

On Thu, Jan 5, 2017 at 7:22 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Jan 4, 2017 17:58, "Jean Carlo" <jean.jeancar...@gmail.com> wrote:
>
> Hello guys
>
> I have a table with 34Gb of data in sstables (including tmp). And I can
> see cassandra is doing some compactions on it. What surprissed me is that
> nodetool compactionstats says he is compacting  138.66GB
>
>
> root@node001 /root # nodetool compactionstats -H
> pending tasks: 103
> *   compaction typekeyspace  table
> completed   total unit   progress*
> Compaction keyspace1   table_02   112.74 GB
> 138.66 GB   bytes 81.31%
> Active compaction remaining time :   0h03m27s
>
> So My question is, from where those 138.66GB come if my table has only
> 34GB of data.
>
>
> Hello,
>
> I believe that output of compactionstats shows you the size of
> *uncompressed* data. Can you check (with nodetool tablestats) your
> compression ratio?
>
> --
> Alex
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Reaper repair seems to "hang"

2017-01-04 Thread Alexander Dejanovski

Actually, the problem is related to CASSANDRA-11430
<https://issues.apache.org/jira/browse/CASSANDRA-11430>.

Before 2.2.6, the notification service did not work with newly deprecated
repair methods, on which Reaper still currently relies.
C* 2.2.6 and onwards are not affected by this problem and work fine with
Reaper.

We're working on switching to the new repair method for 2.2 and 3.0/3.x,
which should be ready in a few days/weeks.

When using incremental repair, watch out for CASSANDRA-11696 which was
fixed in C* 2.1.15, 2.2.7, 3.0.8 and 3.8. In prior versions, unrepaired
SSTables can be marked as repaired, and thus never be repaired.

Cheers,



On Wed, Jan 4, 2017 at 6:09 AM Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> Hi Daniel,
>
> Looks like yours is a different case. If you're running incremental repair
> for the first time it make take long time esp. if table is large. And
> repair may seem to stuck even when things are working.
>
> You can try nodetool compactionstats when repair appears stuck, you'll
> find a validation compaction happening if that's indeed the case.
>
> For the first incremental repair you can follow this doc, in further
> repairs incremental repair should encounter very few sstables:
>
> https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>
> Regards,
> Bhuvan
>
>
>
> On Jan 4, 2017 3:52 AM, "Daniel Kleviansky" <dan...@kleviansky.com> wrote:
>
> Hi Bhuvan,
>
> Thank you so very much for your detailed reply.
> Just to ensure everyone is across the same information, and responses are
> not duplicated across two different forums, I thought I'd share with the
> mailing list that I've created a GitHub issue at:
> https://github.com/thelastpickle/cassandra-reaper/issues/39
>
> Kind regards,
> Daniel
>
> On Wed, Jan 4, 2017 at 6:31 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>
> Hi Daniel,
>
> We faced a similar issue during repair with reaper. We ran repair with
> more repair threads than number of cassandra nodes. But on and off repair
> was getting stuck and we had to do rolling restart of cluster or wait for
> lock time to expire (~1hr).
>
> We had a look at the stuck repair, threadpools were getting stuck at
> AntiEntropy stage. From the synchronized block in repair code it appeared
> that per node at max 1 concurrent repair session per node is possible.
>
> According to
> https://medium.com/@mlowicki/cassandra-reaper-introduction-ed73410492bf#.f0erygqpk
>  :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20
> by default)
> *2. Node is already running repair job*
>
> We tried running reaper with number of threads less than number of nodes
> (assuming reaper will not submit multiple segments to single cassandra
> node) but still it was observed that multiple repair segments were going to
> same node concurrently and threfore chances of nodes getting stuck in that
> state was possible. Finally we settled with single repair thread in reaper
> settings. Although takes a slightly more time but has completed
> successfully numerous times.
>
> Thread Dump of cassandra server when repair was getting stuck:
>
> "*AntiEntropyStage:1" #159 daemon prio=5 os_prio=0 tid=0x7f0fa16226a0
> nid=0x3c82 waiting for monitor entry [0x7ee9eabaf000*]
>java.lang.Thread.State: BLOCKED (*on object monitor*)
> at
> org.apache.cassandra.service.ActiveRepairService.removeParentRepairSession(ActiveRepairService.java:392)
> - waiting to lock <0x00067c083308> (a
> org.apache.cassandra.service.ActiveRepairService)
> at
> org.apache.cassandra.service.ActiveRepairService.doAntiCompaction(ActiveRepairService.java:417)
> at org.apache.cassandra.repair
> .RepairMessageVerbHandler.doVerb(RepairMessageVerbHandler.java:145)
> at org.apache.cassandra.net
> .MessageDeliveryTask.run(MessageDeliveryTask.java:67)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> Hope it helps!
>
> Regards,
> Bhuvan
>
> According to
> https://medium.com/@mlowicki/cassandra-reaper-introduction-ed73410492bf#.f0erygqpk
>  :
>
> Segment runner has protection mechanism to avoid overloading nodes using
> two simple rules to postpone repair if:
>
> 1. Number of pending compactions is greater than *MAX_PENDING_COMPACTIONS* (20
> by default)
> 2. Node is alr

Re: Reaper repair seems to "hang"

2017-01-02 Thread Alexander Dejanovski

Hi Daniel,

could you file a bug in the issue tracker ?
https://github.com/thelastpickle/cassandra-reaper/issues

We'll figure out what's wrong and get your repairs running.

Thanks !

On Tue, Jan 3, 2017 at 12:35 AM Daniel Kleviansky <dan...@kleviansky.com>
wrote:

> Hi everyone,
>
> Using The Last Pickle's fork of Reaper, and unfortunately running into a
> bit of an issue. I'll try break it down below.
>
> # Problem Description:
> * After starting repair via the GUI, progress remains at 0/x.
> * Cassandra nodes calculate their respective token ranges, and then
> nothing happens.
> * There were no errors in the Reaper or Cassandra logs. Only a message of
> acknowledgement that a repair had initiated.
> * Performing stack trace on the running JVM, once can see that the thread
> spawning the repair process was waiting on a lock that was never being
> released.
> * This occurred on all nodes, and prevented any manually initiated repair
> process from running. A rolling restart of each node was required, after
> which one could run a `nodetool repair` successfully.
>
> # Cassandra Cluster Details:
> * Cassandra 2.2.5 running on Windows Server 2008 R2
> * 6 node cluster, split across 2 DCs, with RF = 3:3.
>
> # Reaper Details:
> * Reaper 0.3.3 running on Windows Server 2008 R2, utilising a PostgreSQL
> database.
>
> ## Reaper settings:
> * Parallism: DC-Aware
> * Repair Intensity: 0.9
> * Incremental: true
>
> Don't want to swamp you with more details or unnecessary logs, especially
> as I'd have to sanitize them before sending them out, so please let me know
> if there is anything else I can provide, and I'll do my best to get it to
> you.
>
> Kind regards,
> Daniel
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Exception thrown on repair after change was made to existing cassandra type

2016-12-06 Thread Alexander Dejanovski

Hi Robert,

Some of your nodes did not accept the new schema, or several alter commands
were issued on different nodes concurrently.
Repair cannot fix that as it works on data, not on schema.

To fix your issue, roll restart your whole cluster and all nodes should get
back in agreement on the schema.

Cheers,

On Tue, Dec 6, 2016 at 3:46 PM Robert Sicoie <robert.sic...@gmail.com>
wrote:

> Any suggestion is more than welcome.
> The fact is that the same alter commands worked in two other environments
> and now fails in the thirds environment. No downtime is accepted.
>
> Thanks,
> Robert
>
> Robert Sicoie
>
> On Tue, Dec 6, 2016 at 4:11 PM, Robert Sicoie <robert.sic...@gmail.com>
> wrote:
>
> Hi guys,
>
> I am running cassandra 3.0.5 on 5 nodes, NetworkTopologyStrategy, factor 2
> I had to add new 3 columns to an existing type and I've done it through
> cqlsh.
> After one of the alter commands I've got
>
> *OperationTimedOut: errors={'x.x.x.x': 'Request timed out while waiting
> for schema agreement. See Session.execute[_async](timeout) and
> Cluster.max_schema_agreement_wait.'}, last_host=x.x.x.x*
>
> I've checked with describe schema and the new fields were in place, so I
> suppose that cql tool timed out.
>
> After this I've noticed that the application (connecting through datastax
> java driver, quorum consistency level) started to throw
>
> *com.datastax.driver.core.exceptions.ReadFailureException: Cassandra
> failure during read query at consistency QUORUM (2 responses were required
> but only 0 replica responded, 1 failed)*
>
>
> I thought that this might be related to the schema changed and that repair
> would force schema propagation so I run nodetool repair on the box on which
> I trigged the schema alteration.
>
> Then I got
>
> *... Validation failed in /x.x.x.y (progress: 0%)*
>
> on this node and
>
> *ERROR [CompactionExecutor:37434] 2016-12-06 14:03:38,543
> CassandraDaemon.java:195 - Exception in thread
> Thread[CompactionExecutor:37434,1,main]*
> *java.lang.AssertionError: null*
> * at
> org.apache.cassandra.db.rows.ComplexColumnData$Builder.addCell(ComplexColumnData.java:246)
> ~[apache-cassandra-3.0.5.jar:3.0.5]*
> * at
> org.apache.cassandra.db.rows.Row$Merger$ColumnDataReducer.getReduced(Row.java:613)
> ~[apache-cassandra-3.0.5.jar:3.0.5]*
> * at
> org.apache.cassandra.db.rows.Row$Merger$ColumnDataReducer.getReduced(Row.java:539)
> ~[apache-cassandra-3.0.5.jar:3.0.5]*
> * at
> org.apache.cassandra.utils.MergeIterator$ManyToOne.consume(MergeIterator.java:220)
> ~[apache-cassandra-3.0.5.jar:3.0.5]*
> * at
> org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:159)
> ~[apache-cassandra-3.0.5.jar:3.0.5]*
> * at
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
> ~[apache-cassandra-3.0.5.jar:3.0.5]*
> * at org.apache.cassandra.db.rows.Row$Merger.merge(Row.java:516)
> ~[apache-cassandra-3.0.5.jar:3.0.5]*
> **
> * at
> org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:263)
> ~[apache-cassandra-3.0.5.jar:3.0.5]*
> * at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> ~[na:1.8.0_60]*
> * at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> ~[na:1.8.0_60]*
> * at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> ~[na:1.8.0_60]*
> * at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [na:1.8.0_60]*
> * at java.lang.Thread.run(Thread.java:745) [na:1.8.0_60]*
>
> On the node /x.x.x.y
>
> Do you any suggestion?
> Thank you in advance,
> Robert
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: repair -pr in crontab

2016-11-24 Thread Alexander Dejanovski

Hi,

we maintain a hard fork of Reaper that works with all versions of Cassandra
up to 3.0.x : https://github.com/thelastpickle/cassandra-reaper
Just to save you some time digging into all the forks that could exist.

Cheers,

On Fri, Nov 25, 2016 at 7:37 AM Benjamin Roth <benjamin.r...@jaumo.com>
wrote:

> I recommend using cassandra-reaper
> Using crons without proper Monitoring will most  likely not work as
> expected.
> There are some reaper forks on GitHub. You have to check which one works
> with your Cassandra version. The original one from Spotify only works on
> 2.x not on 3.x
>
> Am 25.11.2016 07:31 schrieb "wxn...@zjqunshuo.com" <wxn...@zjqunshuo.com>:
>
> Hi Artur,
> When I asked similar questions, someone addressed me to the below links
> and they are helpful.
>
> See http://www.datastax.com/dev/blog/repair-in-cassandra
>
> https://lostechies.com/ryansvihla/2015/09/25/cassandras-repair-should-be-called-required-maintenance/
> https://cassandra-zone.com/understanding-repairs/
>
> Cheers,
> -Simon
>
> *From:* Artur Siekielski <a...@vhex.net>
> *Date:* 2016-11-10 04:22
> *To:* user <user@cassandra.apache.org>
> *Subject:* repair -pr in crontab
> Hi,
> the docs give me an impression that repairing should be run manually,
> and not put in crontab for default. Should each repair run be monitored
> manually?
>
> If I would like to put "repair -pr" in crontab for each node, with a few
> hour difference between the runs, are there any risks with such setup?
> Specifically:
> - if two or more "repair -pr" runs on different nodes are running at the
> same time, can it cause any problems besides high load?
> - can "repair -pr" be run simultaneously on all nodes at the same time?
> - I'm using the default gc_grace_period of 10 days. Are there any
> reasons to run repairing more often that once per 10 days, for a case
> when previous repairing fails?
> - how to monitor start and finish times of repairs, and if the runs were
> successful? Does the "nodetool repair" command is guaranteed to exit
> only after the repair is finished and does it return a status code to a
> shell?
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Out of memory and/or OOM kill on a cluster

2016-11-22 Thread Alexander Dejanovski

uested, slices=[-]
>
> So, we're guessing this is bad since it's warning us, however does this
> have a significant on the heap / GC ? I don't really know.
>
> - cfstats tells me this:
>
> Average live cells per slice (last five minutes): 1458.029594846951
> Maximum live cells per slice (last five minutes): 2001.0
> Average tombstones per slice (last five minutes): 1108.2466913854232
> Maximum tombstones per slice (last five minutes): 22602.0
>
> - regarding swap, it's not disabled anywhere, I must say we never really
> thought about it. Does it provide a significant benefit ?
>
> Thanks for your help, really appreciated !
>
> On Mon, Nov 21, 2016, at 04:13 PM, Alexander Dejanovski wrote:
>
> Vincent,
>
> only the 2.68GB partition is out of bounds here, all the others (<256MB)
> shouldn't be much of a problem.
> It could put pressure on your heap if it is often read and/or compacted.
> But to answer your question about the 1% harming the cluster, a few big
> partitions can definitely be a big problem depending on your access
> patterns.
> Which compaction strategy are you using on this table ?
>
> Could you provide/check the following things on a node that crashed
> recently :
>
>- Hardware specifications (how many cores ? how much RAM ? Bare metal
>or VMs ?)
>- Java version
>- GC pauses throughout a day (grep GCInspector
>/var/log/cassandra/system.log) : check if you have many pauses that take
>more than 1 second
>- GC logs at the time of a crash (if you don't produce any, you should
>activate them in cassandra-env.sh)
>- Tombstone warnings in the logs and high number of tombstone read in
>cfstats
>- Make sure swap is disabled
>
>
> Cheers,
>
>
> On Mon, Nov 21, 2016 at 2:57 PM Vincent Rischmann <m...@vrischmann.me>
> wrote:
>
>
> @Vladimir
>
> We tried with 12Gb and 16Gb, the problem appeared eventually too.
> In this particular cluster we have 143 tables across 2 keyspaces.
>
> @Alexander
>
> We have one table with a max partition of 2.68GB, one of 256 MB, a bunch
> with the size varying between 10MB to 100MB ~. Then there's the rest with
> the max lower than 10MB.
>
> On the biggest, the 99% is around 60MB, 98% around 25MB, 95% around 5.5MB.
> On the one with max of 256MB, the 99% is around 4.6MB, 98% around 2MB.
>
> Could the 1% here really have that much impact ? We do write a lot to the
> biggest table and read quite often too, however I have no way to know if
> that big partition is ever read.
>
>
> On Mon, Nov 21, 2016, at 01:09 PM, Alexander Dejanovski wrote:
>
> Hi Vincent,
>
> one of the usual causes of OOMs is very large partitions.
> Could you check your nodetool cfstats output in search of large partitions
> ? If you find one (or more), run nodetool cfhistograms on those tables to
> get a view of the partition sizes distribution.
>
> Thanks
>
> On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin <vla...@winguzone.com>
> wrote:
>
>
> Did you try any value in the range 8-20 (e.g. 60-70% of physical memory).
> Also how many tables do you have across all keyspaces? Each table can
> consume minimum 1M of Java heap.
>
> Best regards, Vladimir Yudovin,
>
> *Winguzone <https://winguzone.com?from=list> - Hosted Cloud
> CassandraLaunch your cluster in minutes.*
>
>
>  On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann
> <m...@vrischmann.me <m...@vrischmann.me>>* wrote 
>
> Hello,
>
> we have a 8 node Cassandra 2.1.15 cluster at work which is giving us a lot
> of trouble lately.
>
> The problem is simple: nodes regularly die because of an out of memory
> exception or the Linux OOM killer decides to kill the process.
> For a couple of weeks now we increased the heap to 20Gb hoping it would
> solve the out of memory errors, but in fact it didn't; instead of getting
> out of memory exception the OOM killer killed the JVM.
>
> We reduced the heap on some nodes to 8Gb to see if it would work better,
> but some nodes crashed again with out of memory exception.
>
> I suspect some of our tables are badly modelled, which would cause
> Cassandra to allocate a lot of data, however I don't how to prove that
> and/or find which table is bad, and which query is responsible.
>
> I tried looking at metrics in JMX, and tried profiling using mission
> control but it didn't really help; it's possible I missed it because I have
> no idea what to look for exactly.
>
> Anyone have some advice for troubleshooting this ?
>
> Thanks.
>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
> -
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Cassandra reaper

2016-11-21 Thread Alexander Dejanovski

Hi Jai,

Reaper is fully open sourced and you should be able to add schedules.
Could you open an issue on GitHub and provide both configuration and error
output (if any) ? >>
https://github.com/thelastpickle/cassandra-reaper/issues


Thanks,

On Tue, Nov 22, 2016 at 1:59 AM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> I noticed that I am not able to add schedules, but I can run repairs.
>
> Is there some limitation on the opensource for adding the schedules?
>
> On Mon, Nov 21, 2016 at 4:25 PM, Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> wrote:
>
> Hello Alexander,
>
> Thanks for the help, I couldn't get around with my issue.
> but I started using : https://github.com/thelastpickle/cassandra-reaper
> it works like a charm :)
>
> I am using GUI, I just need to tweak/play with the configuration.
>
> Thanks again for the help
>
>
> On Tue, Nov 1, 2016 at 12:26 PM, Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> wrote:
>
> ok thank you,
> I will try and update you.
>
> On Tue, Nov 1, 2016 at 10:57 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
> Running reaper with INFO level logging (that can be configured in the yaml
> file), you should have a console output telling you what's going on.
>
> If you started reaper with memory back end, restarting it will reset it
> and you'll have to register your cluster again, but if you used postgres it
> will resume tasks where they were left off.
>
> Please restart Reaper to at least have an output we can get information
> from, otherwise we're blind.
>
> Since you're using Cassandra 2.1, I'd advise switching to our fork since
> the original one is compiled against Cassandra 2.0 libraries. If you switch
> and use postgres, make sure you update the schema accordingly as we added
> fields for incremental repair support.
>
> Cheers,
>
> Le mar. 1 nov. 2016 18:31, Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> a écrit :
>
> Cassandra version is 2.1.16
>
> In my setup I don't see it is writting to any logs
>
> On Tue, Nov 1, 2016 at 10:25 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
> Do you have anything in the reaper logs that would show a failure of some
> sort ?
> Also, can you tell me which version of Cassandra you're using ?
>
> Thanks
>
> On Tue, Nov 1, 2016 at 6:15 PM Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> wrote:
>
> Thanks Alex,
>
> Forgot to mention but I did add the cluster. See the status below. It says
> the status is running but I don't see any repair happening. this is in the
> same state from past 1 days.
> b/w there not much of data in cluster.
>
> [root@machine cassandra-reaper]#  ./bin/spreaper status-repair 3
> # Report improvements/bugs at
> https://github.com/spotify/cassandra-reaper/issues
> #
> --
> # Repair run with id '3':
> {
>   "cause": "manual spreaper run",
>   "cluster_name": "production",
>   "column_families": [],
>   "creation_time": "2016-11-01T00:39:15Z",
>   "duration": null,
>   "end_time": null,
>   "estimated_time_of_arrival": null,
>   "id": 3,
>   "intensity": 0.900,
>   "keyspace_name": "users",
> *  "last_event": "no events",*
>   "owner": "root",
>   "pause_time": null,
>   "repair_parallelism": "DATACENTER_AWARE",
>   "segments_repaired": 0,
>   "start_time": "2016-11-01T00:39:15Z",
> *  "state": "RUNNING",*
>   "total_segments": 301
> }
> [root@ machine cassandra-reaper]#
>
> On Tue, Nov 1, 2016 at 9:24 AM, Alexander Dejanovski <
> a...@thelastpickle.com> wrote:
>
> Hi,
>
> The first step in using reaper is to add a cluster to it, as it is a tool
> that can manage multiple clusters and does not need to be executed on a
> Cassandra node (you can run in on any edge node you want).
>
> You should run : ./bin/spreaper add-cluster 127.0.0.1
> Where you'll replace 127.0.0.1 by the address of one of the nodes of your
> cluster.
>
> Then you can run : ./bin/spreaper cluster_name keyspace_name
> to start repairing a keyspace.
>
> You might want to drop in the UI made by Stefan Podkowinski which might
> ease things up for you, at least at the beginning :
> https://github.com/spodkowinski/cassandra-reaper-ui
>
> Worth mentioning that at The Last Pickle we maintain a fork of Reaper that
> handles incremental repair

Re: Out of memory and/or OOM kill on a cluster

2016-11-21 Thread Alexander Dejanovski

Vincent,

only the 2.68GB partition is out of bounds here, all the others (<256MB)
shouldn't be much of a problem.
It could put pressure on your heap if it is often read and/or compacted.
But to answer your question about the 1% harming the cluster, a few big
partitions can definitely be a big problem depending on your access
patterns.
Which compaction strategy are you using on this table ?

Could you provide/check the following things on a node that crashed
recently :

   - Hardware specifications (how many cores ? how much RAM ? Bare metal or
   VMs ?)
   - Java version
   - GC pauses throughout a day (grep GCInspector
   /var/log/cassandra/system.log) : check if you have many pauses that take
   more than 1 second
   - GC logs at the time of a crash (if you don't produce any, you should
   activate them in cassandra-env.sh)
   - Tombstone warnings in the logs and high number of tombstone read in
   cfstats
   - Make sure swap is disabled


Cheers,


On Mon, Nov 21, 2016 at 2:57 PM Vincent Rischmann <m...@vrischmann.me> wrote:

@Vladimir

We tried with 12Gb and 16Gb, the problem appeared eventually too.
In this particular cluster we have 143 tables across 2 keyspaces.

@Alexander

We have one table with a max partition of 2.68GB, one of 256 MB, a bunch
with the size varying between 10MB to 100MB ~. Then there's the rest with
the max lower than 10MB.

On the biggest, the 99% is around 60MB, 98% around 25MB, 95% around 5.5MB.
On the one with max of 256MB, the 99% is around 4.6MB, 98% around 2MB.

Could the 1% here really have that much impact ? We do write a lot to the
biggest table and read quite often too, however I have no way to know if
that big partition is ever read.


On Mon, Nov 21, 2016, at 01:09 PM, Alexander Dejanovski wrote:

Hi Vincent,

one of the usual causes of OOMs is very large partitions.
Could you check your nodetool cfstats output in search of large partitions
? If you find one (or more), run nodetool cfhistograms on those tables to
get a view of the partition sizes distribution.

Thanks

On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin <vla...@winguzone.com>
wrote:


Did you try any value in the range 8-20 (e.g. 60-70% of physical memory).
Also how many tables do you have across all keyspaces? Each table can
consume minimum 1M of Java heap.

Best regards, Vladimir Yudovin,

*Winguzone <https://winguzone.com?from=list> - Hosted Cloud CassandraLaunch
your cluster in minutes.*


 On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann <m...@vrischmann.me
<m...@vrischmann.me>>* wrote 

Hello,

we have a 8 node Cassandra 2.1.15 cluster at work which is giving us a lot
of trouble lately.

The problem is simple: nodes regularly die because of an out of memory
exception or the Linux OOM killer decides to kill the process.
For a couple of weeks now we increased the heap to 20Gb hoping it would
solve the out of memory errors, but in fact it didn't; instead of getting
out of memory exception the OOM killer killed the JVM.

We reduced the heap on some nodes to 8Gb to see if it would work better,
but some nodes crashed again with out of memory exception.

I suspect some of our tables are badly modelled, which would cause
Cassandra to allocate a lot of data, however I don't how to prove that
and/or find which table is bad, and which query is responsible.

I tried looking at metrics in JMX, and tried profiling using mission
control but it didn't really help; it's possible I missed it because I have
no idea what to look for exactly.

Anyone have some advice for troubleshooting this ?

Thanks.

-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Out of memory and/or OOM kill on a cluster

2016-11-21 Thread Alexander Dejanovski

Hi Vincent,

one of the usual causes of OOMs is very large partitions.
Could you check your nodetool cfstats output in search of large partitions
? If you find one (or more), run nodetool cfhistograms on those tables to
get a view of the partition sizes distribution.

Thanks

On Mon, Nov 21, 2016 at 12:01 PM Vladimir Yudovin <vla...@winguzone.com>
wrote:

> Did you try any value in the range 8-20 (e.g. 60-70% of physical memory).
> Also how many tables do you have across all keyspaces? Each table can
> consume minimum 1M of Java heap.
>
> Best regards, Vladimir Yudovin,
>
> *Winguzone <https://winguzone.com?from=list> - Hosted Cloud
> CassandraLaunch your cluster in minutes.*
>
>
>  On Mon, 21 Nov 2016 05:13:12 -0500*Vincent Rischmann
> <m...@vrischmann.me <m...@vrischmann.me>>* wrote 
>
> Hello,
>
> we have a 8 node Cassandra 2.1.15 cluster at work which is giving us a lot
> of trouble lately.
>
> The problem is simple: nodes regularly die because of an out of memory
> exception or the Linux OOM killer decides to kill the process.
> For a couple of weeks now we increased the heap to 20Gb hoping it would
> solve the out of memory errors, but in fact it didn't; instead of getting
> out of memory exception the OOM killer killed the JVM.
>
> We reduced the heap on some nodes to 8Gb to see if it would work better,
> but some nodes crashed again with out of memory exception.
>
> I suspect some of our tables are badly modelled, which would cause
> Cassandra to allocate a lot of data, however I don't how to prove that
> and/or find which table is bad, and which query is responsible.
>
> I tried looking at metrics in JMX, and tried profiling using mission
> control but it didn't really help; it's possible I missed it because I have
> no idea what to look for exactly.
>
> Anyone have some advice for troubleshooting this ?
>
> Thanks.
>
> --
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Cassandra 3.6 Repair issue with Reaper

2016-11-14 Thread Alexander Dejanovski

Hi Abhishek,

Can you check if you're getting the same behavior on this cluster using
nodetool commands to start repair ? (don't forget to add --full in order to
make sure you're not running incremental repair, if that's indeed what
you're doing with reaper).
Could you also open an issue on github for this, with the useful logs you
can get from your Cassandra nodes ?
https://github.com/thelastpickle/cassandra-reaper/issues

Thanks,

On Tue, Nov 15, 2016 at 7:00 AM Abhishek Aggarwal <
abhishek.aggarwa...@snapdeal.com> wrote:

> Hi All,
>
> we tried sequential repair on very small table having only 20 Rows using
> the reaper tool. But the repair got stuck while generating the snapshot.
>
>
> Same when we tried with Parallel repair then run was working fine in the
> begining for few segments but later it got stuck in the compaction and
> never got completed and one of the node was shown down to other nodes due
> to gossip issue and we had to do rolling restart of all the nodes.
>
>
> In both the cases 2 nodes out of 6 nodes are getting stuck either in
> compaction or in generating snapshot.
>
>
> Abhishek Aggarwal
>
> *Senior Software Engineer*
> *M*: +91 8861212073 <+91%2088612%2012073> , 8588840304
> *T*: 0124 6600600 *EXT*: 12128
> ASF Center -A, ASF Center Udyog Vihar Phase IV,
> Download Our App
> [image: A]
> <https://play.google.com/store/apps/details?id=com.snapdeal.main_source=mobileAppLp_campaign=android>
>  [image:
> A]
> <https://itunes.apple.com/in/app/snapdeal-mobile-shopping/id721124909?ls=1=8_source=mobileAppLp_campaign=ios>
>  [image:
> W]
> <http://www.windowsphone.com/en-in/store/app/snapdeal/ee17fccf-40d0-4a59-80a3-04da47a5553f>
>
-- 
-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

1 2 >

1 - 100 of 135 matches

Mail list logo