Re: nodetool suddenly failing with "Access denied!"

2023-02-26 Thread Sergio
Hey!
I would try to spin up a new node and see if the problem occurs on it.
If it happens, I would check the history of changes on the cookbook recipe,
if you don't find any problem on the new node you might replace all the
nodes having problems one by one with a new one and decommission the
affected ones.
it would cost some time and money but better than having a node tool not
working

Best,

Sergio

Il giorno dom 26 feb 2023 alle ore 10:51 Abe Ratnofsky  ha
scritto:

> Hey Mitch,
>
> The security upgrade schedule that your colleague is working on may well
> be relevant. Is your entire cluster on 3.11.6 or are the failing hosts
> possibly on a newer version?
>
> Abe
>
> On Feb 26, 2023, at 10:38, Mitch Gitman  wrote:
>
> 
>
> We're running Cassandra 3.11.6 on AWS EC2 instances. These clusters have
> been running for a few years.
>
>
> We're suddenly noticing now that on one of our clusters the nodetool
> command is failing on certain nodes but not on others.
>
>
> The failure:
>
> nodetool: Failed to connect to '...:7199' - SecurityException: 'Access
> denied! Invalid access level for requested MBeanServer operation.'.
>
>
> I suspect that this stems from some colleague I'm not in coordination with
> recently doing some security upgrades, but that's a bit of an academic
> matter for now.
>
>
> I've compared the jmxremote.access and jvm.options files on a host where
> nodetool is not working vs. a host where nodetool is working, and no
> meaningful differences.
>
>
> Any ideas? The interesting aspect of this problem is that it is occurring
> on some nodes in the one cluster but not others.
>
>
> I'll update on this thread if I find any solutions on my end.
>
>


Re: Send large blobs

2022-05-31 Thread Sergio
However, if I were you I would avoid that... Maybe I will place a url to S3
or GFS in Cassandra

Best,

Sergio

On Tue, May 31, 2022, 4:10 PM Sergio  wrote:

> You have to split it by yourself
> Best,
> Sergio
>
> On Tue, May 31, 2022, 3:56 PM Andria Trigeorgis 
> wrote:
>
>> Thank you for your prompt reply!
>> So, I have to split the blob into chunks by myself, or there is any
>> fragmentation mechanism in Cassandra?
>>
>>
>> On 31 May 2022, at 4:44 PM, Dor Laor  wrote:
>>
>> On Tue, May 31, 2022 at 4:40 PM Andria Trigeorgi 
>> wrote:
>>
>>> Hi,
>>>
>>> I want to write large blobs in Cassandra. However, when I tried to write
>>> more than a 256MB blob, I got the message:
>>> "Error from server: code=2200 [Invalid query] message=\"Request is too
>>> big: length 268435580 exceeds maximum allowed length 268435456.\"".
>>>
>>> I tried to change the variables "max_value_size_in_mb" and "
>>> native_transport_max_frame_size_in_mb" of the file "
>>> /etc/cassandra/cassandra.yaml" to 512, but I got a
>>> ConnectionRefusedError error. What am I doing wrong?
>>>
>>
>> You sent a large blob ;)
>>
>> This limitation exists to protect you as a user.
>> The DB can store such blobs but it will incur a large and unexpected
>> latency, not just
>> for the query but also for under-the-hood operations, like backup and
>> repair.
>>
>> Best is not to store such large blobs in Cassandra or chop them into
>> smaller
>> units, let's say 10MB pieces and re-assemble in the app.
>>
>>
>>>
>>> Thank you in advance,
>>>
>>> Andria
>>>
>>
>>


Re: Send large blobs

2022-05-31 Thread Sergio
You have to split it by yourself
Best,
Sergio

On Tue, May 31, 2022, 3:56 PM Andria Trigeorgis 
wrote:

> Thank you for your prompt reply!
> So, I have to split the blob into chunks by myself, or there is any
> fragmentation mechanism in Cassandra?
>
>
> On 31 May 2022, at 4:44 PM, Dor Laor  wrote:
>
> On Tue, May 31, 2022 at 4:40 PM Andria Trigeorgi 
> wrote:
>
>> Hi,
>>
>> I want to write large blobs in Cassandra. However, when I tried to write
>> more than a 256MB blob, I got the message:
>> "Error from server: code=2200 [Invalid query] message=\"Request is too
>> big: length 268435580 exceeds maximum allowed length 268435456.\"".
>>
>> I tried to change the variables "max_value_size_in_mb" and "
>> native_transport_max_frame_size_in_mb" of the file "
>> /etc/cassandra/cassandra.yaml" to 512, but I got a
>> ConnectionRefusedError error. What am I doing wrong?
>>
>
> You sent a large blob ;)
>
> This limitation exists to protect you as a user.
> The DB can store such blobs but it will incur a large and unexpected
> latency, not just
> for the query but also for under-the-hood operations, like backup and
> repair.
>
> Best is not to store such large blobs in Cassandra or chop them into
> smaller
> units, let's say 10MB pieces and re-assemble in the app.
>
>
>>
>> Thank you in advance,
>>
>> Andria
>>
>
>


Re: Running Large Clusters in Production

2020-07-10 Thread Sergio
Sorry for the dumb question:

When we refer to 1000 nodes divided in 10 clusters(shards): we would have
100 nodes per cluster
A shard is not intended as Datacenter but it would be a cluster itself that
it doesn't talk with the other ones so there should be some routing logic
at the application level to route the requests to the correct cluster?
Is this the recommended approach?

Thanks



On Fri, Jul 10, 2020, 4:06 PM Jon Haddad  wrote:

> I worked on a handful of large clusters (> 200 nodes) using vnodes, and
> there were some serious issues with both performance and availability.  We
> had to put in a LOT of work to fix the problems.
>
> I agree with Jeff - it's way better to manage multiple clusters than a
> really large one.
>
>
> On Fri, Jul 10, 2020 at 2:49 PM Jeff Jirsa  wrote:
>
>> 1000 instances are fine if you're not using vnodes.
>>
>> I'm not sure what the limit is if you're using vnodes.
>>
>> If you might get to 1000, shard early before you get there. Running 8x100
>> host clusters will be easier than one 800 host cluster.
>>
>>
>> On Fri, Jul 10, 2020 at 2:19 PM Isaac Reath (BLOOMBERG/ 919 3RD A) <
>> ire...@bloomberg.net> wrote:
>>
>>> Hi All,
>>>
>>> I’m currently dealing with a use case that is running on around 200
>>> nodes, due to growth of their product as well as onboarding additional data
>>> sources, we are looking at having to expand that to around 700 nodes, and
>>> potentially beyond to 1000+. To that end I have a couple of questions:
>>>
>>> 1) For those who have experienced managing clusters at that scale, what
>>> types of operational challenges have you run into that you might not see
>>> when operating 100 node clusters? A couple that come to mind are version
>>> (especially major version) upgrades become a lot more risky as it no longer
>>> becomes feasible to do a blue / green style deployment of the database and
>>> backup & restore operations seem far more error prone as well for the same
>>> reason (having to do an in-place restore instead of being able to spin up a
>>> new cluster to restore to).
>>>
>>> 2) Is there a cluster size beyond which sharding across multiple
>>> clusters becomes the recommended approach?
>>>
>>> Thanks,
>>> Isaac
>>>
>>>


Re: Nodetool clearsnapshot does not delete snapshot for dropped column_family

2020-04-30 Thread Sergio
The problem is that folder is not under snapshot but it is under the data
path.
I tried with the --all switch too
Thanks,
Sergio

On Thu, Apr 30, 2020, 4:21 PM Nitan Kainth  wrote:

> I don't think it works like that. clearsnapshot --all would remove all
> snapshots. Here is an example:
>
> $ ls -l
> /ss/xx/cassandra/data/ww/a-5bf825428b3811eabe0c6b7631a60bb0/snapshots/
>
> total 8
>
> drwxr-xr-x 2 cassandra cassandra 4096 Apr 30 23:17 dropped-1588288650821-a
>
> drwxr-xr-x 2 cassandra cassandra 4096 Apr 30 23:17 manual
>
> $ nodetool clearsnapshot --all
>
> Requested clearing snapshot(s) for [all keyspaces] with [all snapshots]
>
> $ ls -l
> /ss/xx/cassandra/data/ww/a-5bf825428b3811eabe0c6b7631a60bb0/snapshots/
>
> ls: cannot access
> /ss/xx/cassandra/data/ww/a-5bf825428b3811eabe0c6b7631a60bb0/snapshots/:
> No such file or directory
>
> $
>
>
> On Thu, Apr 30, 2020 at 5:44 PM Erick Ramirez 
> wrote:
>
>> Yes, you're right. It doesn't show up in listsnapshots nor does
>> clearsnapshot remove the dropped snapshot because the table is no longer
>> managed by C* (because it got dropped). So you will need to manually remove
>> the dropped-* directories from the filesystem.
>>
>> Someone here will either correct me or hopefully provide a
>> user-friendlier solution. Cheers!
>>
>


Nodetool clearsnapshot does not delete snapshot for dropped column_family

2020-04-30 Thread Sergio Bilello
Hi guys!
I am running cassandra 3.11.4. I dropped a column_family but I am able to see 
the disk space occupied by that column_family in the disk. I understood that 
since I have the auto_snapshot flag = true this behavior is expected.
However, I would like to avoid to write a dummy script that removes the 
column_family folder for each node.
I tried the nodetool clearsnapshot command but it didn't work and when I try to 
nodetool listsnapshots I don't see anything. It is like hidden that space 
occupied.

Any suggestion?

Thanks,

Sergio

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: New seed node in the cluster immediately UN without passing for UJ state

2020-02-25 Thread Sergio
Hi Erick!

Just follow up to your statement:

Limiting the seeds to 2 per DC means :

A) Each node in a DC has at least 2 seeds and those seeds belong to the
same DC
or
B) Each node in a DC has at least 2 seeds even across different DC


Thanks,

Sergio


Il giorno gio 13 feb 2020 alle ore 19:46 Erick Ramirez <
erick.rami...@datastax.com> ha scritto:

> Not a problem. And I've just responded on the new thread. Cheers! 
>
>>


Re: IN OPERATOR VS BATCH QUERY

2020-02-20 Thread Sergio
The current approach is delete from key_value where id = whatever and it is
performed asynchronously from the client.
I was thinking to reduce at least the network round-trips between client
and coordinator with that Batch approach. :)

In any case, I would test it it will improve or not. So when do you use
batch then?

Best,

Sergio

On Thu, Feb 20, 2020, 6:18 PM Erick Ramirez 
wrote:

> Batches aren't really meant for optimisation in the same way as RDBMS. If
> anything, it will just put pressure on the coordinator having to fire off
> multiple requests to lots of replicas. The IN operator falls into the same
> category and I personally wouldn't use it with more than 2 or 3 partitions
> because then the coordinator will suffer from the same problem.
>
> If it were me, I'd just issue single-partition deletes and throttle it to
> a "reasonable" throughput that your cluster can handle. The word
> "reasonable" is in quotes because only you can determine that magic number
> for your cluster through testing. Cheers!
>


IN OPERATOR VS BATCH QUERY

2020-02-20 Thread Sergio Bilello
Hi guys!

Let's say we have a KEY-VALUE schema

The goal is to delete the KEYS in batches without burning the cluster and be 
efficient as soon as possible

I would like to know if it is better to run the query with DELETE FROM 
KEY_VALUE_COLUMN_FAMILY WHERE KEY IN ('A','B','C'); At most 10 KEYS in the IN 
STATEMENT 

OR

HANDLE WITH A CASSANDRA BATCH QUERY and in particular, I was looking at 
https://docs.spring.io/spring-data/cassandra/docs/current/api/org/springframework/data/cassandra/core/ReactiveCassandraBatchOperations.html#delete-java.lang.Iterable-

Thanks,

Sergio




-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: AWS I3.XLARGE retiring instances advices

2020-02-16 Thread Sergio
I really like these conversations. So feel free to continue this one or
create a new one Thanks to everyone participating :)


Il giorno dom 16 feb 2020 alle ore 14:04 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> No actually in this case I didn’t really have an opinion because C* is an
> architecturally different beast than an RDBMS.  That’s kinda what ticked
> the curiosity when you made the suggestion about co-locating commit and
> data.  It raises an interesting question for me.  As for the 10 seconds
> delay, I’m used to looking at graphite, so bad is relative. 
>
>
>
> The question that pops to mind is this. If a commit log isn’t really an
> important recovery mechanism…. should one even be part of C* at all?  It’s
> a lot of code complexity and I/O volume and O/S tuning complexity to worry
> about having good I/O resiliency and performance with both commit and data
> volumes.
>
>
>
> If the proper way to deal with all data volume problems in C* would be to
> burn the node (or at least, it’s state) and rebuild via the state of its
> neighbours, then repairs (whether administratively triggered, or as a
> side-effect of ongoing operations) should always catch up with any
> mutations anyways so long as the data is appropriately replicated.  The
> benefit to the having a commit log would seem limited to data which isn’t
> replicated.
>
>
>
> However, I shouldn’t derail Sergio’s thread.  It just was something that
> caught my interest and got me mulling, but it’s a tangent.
>
>
>
> *From: *Erick Ramirez 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Friday, February 14, 2020 at 9:04 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: AWS I3.XLARGE retiring instances advices
>
>
>
> *Message from External Sender*
>
> Erick, a question purely as a point of curiosity.  The entire model of a
> commit log, historically (speaking in RDBS terms), depended on a notion of
> stable store. The idea being that if your data volume lost recent writes,
> the failure mode there would be independent of writes to the volume holding
> the commit log, so that replay of the commit log could generally be
> depended on to recover the missing data.  I’d be curious what the C* expert
> viewpoint on that would be, with the commit log and data on the same volume.
>
>
>
> Those are fair points so thanks for bringing them up. I'll comment from a
> personal viewpoint and others can provide their opinions/feedback.
>
>
>
> If you think about it, you've lost the data volume -- not just the recent
> writes. Replaying the mutations in the commit log is probably insignificant
> compared to having to recover the data through various ways (re-bootstrap,
> refresh from off-volume/off-server snapshots, etc). The data and
> redo/archive logs being on the same volume (in my opinion) is more relevant
> in RDBMS since they're mostly deployed on SANs compared to the
> nothing-shared architecture of C*. I know that's debatable and others will
> have their own view. :)
>
>
>
> How about you, Reid? Do you have concerns about both data and commitlog
> being on the same disk? And slightly off-topic but by extension, do you
> also have concerns about the default commitlog fsync() being 10 seconds?
> Cheers!
>


Re: AWS I3.XLARGE retiring instances advices

2020-02-13 Thread Sergio
Thank you for the advices!

Best!

Sergio

On Thu, Feb 13, 2020, 7:44 PM Erick Ramirez 
wrote:

> Option 1 is a cheaper option because the cluster doesn't need to rebalance
> (with the loss of a replica) post-decommission then rebalance again when
> you add a new node.
>
> The hints directory on EBS is irrelevant because it would only contain
> mutations to replay to down replicas if the node was a coordinator. In the
> scenario where the node itself goes down, other nodes will be storing hints
> for this down node. The saved_caches are also useless if you're
> bootstrapping the node into the cluster because the cache entries are only
> valid for the previous data files, not the newly streamed files from the
> bootstrap. Similarly, your commitlog directory will be empty -- that's
> the whole point of running nodetool drain. :)
>
> A little off-topic but *personally* I would co-locate the commitlog on
> the same 950GB NVMe SSD as the data files. You would get a much better
> write performance from the nodes compared to EBS and they shouldn't hurt
> your reads since the NVMe disks have very high IOPS. I think they can
> sustain 400K+ IOPS (don't quote me). I'm sure others will comment if they
> have a different experience. And of course, YMMV. Cheers!
>
>
>
> On Fri, 14 Feb 2020 at 14:16, Sergio  wrote:
>
>> We have i3xlarge instances with data directory in the XFS filesystem that
>> is ephemeral and *hints*, *commit_log* and *saved_caches* in the EBS
>> volume.
>> Whenever AWS is going to retire the instance due to degraded hardware
>> performance is it better:
>>
>> Option 1)
>>- Nodetool drain
>>- Stop cassandra
>>- Restart the machine from aws-cli to be restored in a different VM
>> from the hypervisor
>>- Start Cassandra with -Dcassandra.replace_address
>>- We lose only the ephemeral but the commit_logs, hints, saved_cache
>> will be there
>>
>>
>> OR
>>
>> Option 2)
>>  - Add a new node and wait for the NORMAL status
>>  - Decommission the one that is going to be retired
>>  - Run cleanup with cstar across the datacenters
>>
>> ?
>>
>> Thanks,
>>
>> Sergio
>>
>


Re: New seed node in the cluster immediately UN without passing for UJ state

2020-02-13 Thread Sergio
Thank you very much for this helpful information!

I opened a new thread for the other question :)

Sergio

Il giorno gio 13 feb 2020 alle ore 19:22 Erick Ramirez <
erick.rami...@datastax.com> ha scritto:

> I want to have more than one seed node in each DC, so unless I don't
>> restart the node after changing the seed_list in that node it will not
>> become the seed.
>
>
> That's not really going to hurt you if you have other seeds in other DCs.
> But if you're willing to take the hit from the restart then feel free to do
> so. Just saying that it's not necessary to do it immediately so the option
> is there for you. :)
>
>
> Do I need to update the seed_list across all the nodes even in separate
>> DCs and perform a rolling restart even across DCs or the restart should be
>> happening only on the new node that I want as a seed?
>
>
> You generally want to make the seeds list the same across all nodes in the
> cluster. You want to avoid the situation where lots of nodes are used as
> seeds by various nodes. Limiting the seeds to 2 per DC means that gossip
> convergence will happen much faster. Cheers!
>
>>


AWS I3.XLARGE retiring instances advices

2020-02-13 Thread Sergio
We have i3xlarge instances with data directory in the XFS filesystem that
is ephemeral and *hints*, *commit_log* and *saved_caches* in the EBS
volume.
Whenever AWS is going to retire the instance due to degraded hardware
performance is it better:

Option 1)
   - Nodetool drain
   - Stop cassandra
   - Restart the machine from aws-cli to be restored in a different VM from
the hypervisor
   - Start Cassandra with -Dcassandra.replace_address
   - We lose only the ephemeral but the commit_logs, hints, saved_cache
will be there


OR

Option 2)
 - Add a new node and wait for the NORMAL status
 - Decommission the one that is going to be retired
 - Run cleanup with cstar across the datacenters

?

Thanks,

Sergio


Re: New seed node in the cluster immediately UN without passing for UJ state

2020-02-13 Thread Sergio
Right now yes I have one seed per DC.

I want to have more than one seed node in each DC, so unless I don't
restart the node after changing the seed_list in that node it will not
become the seed.

Do I need to update the seed_list across all the nodes even in separate DCs
and perform a rolling restart even across DCs or the restart should be
happening only on the new node that I want as a seed?

The reason each Datacenter has:
a seed from the current DC belongs to and a seed from the other DC.

Thanks,

Sergio


Il giorno gio 13 feb 2020 alle ore 18:41 Erick Ramirez <
erick.rami...@datastax.com> ha scritto:

> 1) If I don't restart the node after changing the seed list this will
>> never become the seed and I would like to be sure that I don't find my self
>> in a spot where I don't have seed nodes and this means that I can not add a
>> node in the cluster
>
>
> Are you saying you only have 1 seed node in the seeds list of each node?
> We recommend 2 nodes per DC as seeds -- if one node is down, there's still
> another node in the local DC to contact. In the worst case scenario where 2
> nodes in the local DC are down, then nodes can contact seeds in other DCs.
>
> For the second item, could I make a small request? Since it's unrelated to
> this thread, would you mind starting up a new email thread? It just makes
> it easier for other users to follow the threads in the future if they're
> searching for answers to similar questions. Cheers!
>
>>


Re: New seed node in the cluster immediately UN without passing for UJ state

2020-02-13 Thread Sergio
Thank you very much for your response!

2 things:

1) If I don't restart the node after changing the seed list this will never
become the seed and I would like to be sure that I don't find my self in a
spot where I don't have seed nodes and this means that I can not add a node
in the cluster

2) We have i3xlarge instances with data directory in the XFS filesystem
that is ephemeral and hints, commit_log and saved_caches in the EBS volume.
Whenever AWS is going to retire the instance due to degraded hardware
performance is it better:

Option 1)
   - Nodetool drain
   - Stop cassandra
   - Restart the machine from aws to be restored in a different VM from the
hypervisor
   - Start Cassandra with -Dcassandra.replace_address

OR
Option 2)
 - Add a new node and wait for the NORMAL status
 - Decommission the one that is going to be retired
 - Run cleanup with cstar across the datacenters

?

Thanks,

Sergio




Il giorno gio 13 feb 2020 alle ore 18:15 Erick Ramirez <
erick.rami...@datastax.com> ha scritto:

> I did decommission of this node and I did all the steps mentioned except
>> the -Dcassandra.replace_address and now it is streaming correctly!
>
>
> That works too but I was trying to avoid the rebalance operations (like
> streaming to restore replica counts) since they can be expensive.
>
> So basically, if I want this new node as seed should I add its IP address
>> after it joined the cluster and after
>> - nodetool drain
>> - restart cassandra?
>
>
> There's no need to restart C* after updating the seeds list. It will just
> take effect the next time you restart.
>
> I deactivated the future repair happening in the cluster while this node
>> is joining.
>> When you add a node is it better to stop the repair process?
>
>
> It's not necessary to do so if you have sufficient capacity in your
> cluster. Topology changes are just a normal part of a C* cluster's
> operation just like repairs. But when you temporarily disable repairs,
> existing nodes have more capacity to bootstrap a new node so there is a
> benefit there. Cheers!
>
>>


Re: New seed node in the cluster immediately UN without passing for UJ state

2020-02-13 Thread Sergio
I did decommission of this node and I did all the steps mentioned except
the -Dcassandra.replace_address and now it is streaming correctly!

So basically, if I want this new node as seed should I add its IP address
after it joined the cluster and after
- nodetool drain
- restart cassandra?

I deactivated the future repair happening in the cluster while this node is
joining.

When you add a node is it better to stop the repair process?

Thank you very much Erick!

Best,

Sergio


Il giorno gio 13 feb 2020 alle ore 17:52 Erick Ramirez <
erick.rami...@datastax.com> ha scritto:

> Should I do something to fix it or leave as it?
>
>
> It depends on what your intentions are. I would use the "replace" method
> to build it correctly. At a high level:
> - remove the IP from it's own seeds list
> - delete the contents of data, commitlog and saved_caches
> - add the replace flag in cassandra-env.sh (
> -Dcassandra.replace_address=its_own_ip)
> - start C*
>
> That should allow the node to "replace itself" in the ring and prevent
> expensive reshuffling/rebalancing of tokens. Cheers!
>
>>


Re: New seed node in the cluster immediately UN without passing for UJ state

2020-02-13 Thread Sergio
Thanks for your fast reply!

No repairs are running!

https://cassandra.apache.org/doc/latest/faq/index.html#does-single-seed-mean-single-point-of-failure

I added the node IP itself and the IP of existing seeds and I started
Cassandra.

So the right procedure is not to add in the seed list the new node and an
already existing seed node and then start Cassandra?

What should I do? I am running nodetool nestats and the streaming are
happening from other nodes

Thanks


Il giorno gio 13 feb 2020 alle ore 17:39 Erick Ramirez <
erick.rami...@datastax.com> ha scritto:

> I wanted to add a new node in the cluster and it looks to be working fine
>> but instead to wait for 2-3 hours data streaming like 100GB it immediately
>> went to the UN (UP and NORMAL) state.
>>
>
> Are you running a repair? I can't see how it's possibly receiving 100GB
> since it won't bootstrap.
>


Re: New seed node in the cluster immediately UN without passing for UJ state

2020-02-13 Thread Sergio
Should I do something to fix it or leave as it?

On Thu, Feb 13, 2020, 5:29 PM Jon Haddad  wrote:

> Seeds don't bootstrap, don't list new nodes as seeds.
>
> On Thu, Feb 13, 2020 at 5:23 PM Sergio  wrote:
>
>> Hi guys!
>>
>> I don't know how but this is the first time that I see such behavior. I
>> wanted to add a new node in the cluster and it looks to be working fine but
>> instead to wait for 2-3 hours data streaming like 100GB it immediately went
>> to the UN (UP and NORMAL) state.
>>
>> I saw a bunch of exception in the logs and WARN
>>  [MessagingService-Incoming-/10.1.17.126] 2020-02-14 01:08:07,812
>> IncomingTcpConnection.java:103 - UnknownColumnFamilyException reading from
>> socket; closing
>> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table
>> for cfId a5af88d0-24f6-11e9-b009-95ed77b72f6e. If a table was just created,
>> this is likely due to the schema not being fully propagated.  Please wait
>> for schema agreement on table creation.
>> at
>> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1525)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at
>> org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:850)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at
>> org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:825)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at
>> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:415)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at
>> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at
>> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at
>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at
>> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>> at
>> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]
>>
>> but in the end, it is working...
>>
>> Suggestion?
>>
>> Thanks,
>>
>> Sergio
>>
>


New seed node in the cluster immediately UN without passing for UJ state

2020-02-13 Thread Sergio
Hi guys!

I don't know how but this is the first time that I see such behavior. I
wanted to add a new node in the cluster and it looks to be working fine but
instead to wait for 2-3 hours data streaming like 100GB it immediately went
to the UN (UP and NORMAL) state.

I saw a bunch of exception in the logs and WARN
 [MessagingService-Incoming-/10.1.17.126] 2020-02-14 01:08:07,812
IncomingTcpConnection.java:103 - UnknownColumnFamilyException reading from
socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table
for cfId a5af88d0-24f6-11e9-b009-95ed77b72f6e. If a table was just created,
this is likely due to the schema not being fully propagated.  Please wait
for schema agreement on table creation.
at
org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1525)
~[apache-cassandra-3.11.5.jar:3.11.5]
at
org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:850)
~[apache-cassandra-3.11.5.jar:3.11.5]
at
org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:825)
~[apache-cassandra-3.11.5.jar:3.11.5]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:415)
~[apache-cassandra-3.11.5.jar:3.11.5]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434)
~[apache-cassandra-3.11.5.jar:3.11.5]
at
org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371)
~[apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
~[apache-cassandra-3.11.5.jar:3.11.5]
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
~[apache-cassandra-3.11.5.jar:3.11.5]
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
~[apache-cassandra-3.11.5.jar:3.11.5]
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)
~[apache-cassandra-3.11.5.jar:3.11.5]

but in the end, it is working...

Suggestion?

Thanks,

Sergio


Re: [EXTERNAL] Cassandra 3.11.X upgrades

2020-02-13 Thread Sergio
   - Verify that nodetool upgradesstables has completed successfully on all
   nodes from any previous upgrade
   - Turn off repairs and any other streaming operations (add/remove nodes)
   - Nodetool drain on the node that needs to be stopped (seeds first,
   preferably)
   - Stop an un-upgraded node (seeds first, preferably)
   - Install new binaries and configs on the down node
   - Restart that node and make sure it comes up clean (it will function
   normally in the cluster – even with mixed versions)
   - nodetool statusbinary to verify if it is up and running
   - Repeat for all nodes
   - Once the binary upgrade has been performed in all the nodes: Run
   upgradesstables on each node (as many at a time as your load will allow).
   Minor upgrades usually don’t require this step (only if the sstable format
   has changed), but it is good to check.
   - NOTE: in most cases applications can keep running and will not notice
   much impact – unless the cluster is overloaded and a single node down
   causes impact.



   I added 2 points to the list to clarify.

   Should we add this in a FAQ in the cassandra doc or in the awesome
   cassandra https://cassandra.link/awesome/

   Thanks,

   Sergio


Il giorno mer 12 feb 2020 alle ore 10:58 Durity, Sean R <
sean_r_dur...@homedepot.com> ha scritto:

> Check the readme.txt for any upgrade notes, but the basic procedure is to:
>
>- Verify that nodetool upgradesstables has completed successfully on
>all nodes from any previous upgrade
>- Turn off repairs and any other streaming operations (add/remove
>nodes)
>- Stop an un-upgraded node (seeds first, preferably)
>- Install new binaries and configs on the down node
>- Restart that node and make sure it comes up clean (it will function
>normally in the cluster – even with mixed versions)
>- Repeat for all nodes
>- Run upgradesstables on each node (as many at a time as your load
>will allow). Minor upgrades usually don’t require this step (only if the
>sstable format has changed), but it is good to check.
>- NOTE: in most cases applications can keep running and will not
>notice much impact – unless the cluster is overloaded and a single node
>down causes impact.
>
>
>
>
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
>
>
> *From:* Sergio 
> *Sent:* Wednesday, February 12, 2020 11:36 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Cassandra 3.11.X upgrades
>
>
>
> Hi guys!
>
> How do you usually upgrade your cluster for minor version upgrades?
>
> I tried to add a node with 3.11.5 version to a test cluster with 3.11.4
> nodes.
>
> Is there any restriction?
>
> Best,
>
> Sergio
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: [EXTERNAL] Cassandra 3.11.X upgrades

2020-02-12 Thread Sergio
Should I follow the steps above right?
Thanks Erick!

On Wed, Feb 12, 2020, 6:58 PM Erick Ramirez 
wrote:

> In case you have an hybrid situation with 3.11.3 , 3.11.4 and 3.11.5 that
>> it is working and it is in production what do you recommend?
>
>
> You shouldn't end up in this mixed-version situation at all. I would
> highly recommend you upgrade all the nodes to 3.11.5 or whatever the latest
> version is installed on the nodes. Mixed-versions isn't a tested or
> supported scenario and the cluster's behaviour can be unpredictable. The
> behaviour might not be catastrophic but you don't want to be the one who
> discovers some exotic bug that arises out of that configuration. Cheers!
>
>>


Re: [EXTERNAL] Cassandra 3.11.X upgrades

2020-02-12 Thread Sergio
Thanks everyone!

In case you have an hybrid situation with 3.11.3 , 3.11.4 and 3.11.5 that
it is working and it is in production what do you recommend?



On Wed, Feb 12, 2020, 5:55 PM Erick Ramirez 
wrote:

> So unless the sstable format has not been changed I can avoid to do that.
>
>
> Just to reinforce what Jon and Sean already said, the above assumption is
> dangerous. It is always best to follow the recommended upgrade procedure
> and mixed-versions is never a good idea unless you've received instructions
> from a qualified source to address a specific issue. But as Jon said, we
> wouldn't be on this mailing list otherwise. 
>
> Erick Ramirez  |  Developer Relations
>
> erick.rami...@datastax.com | datastax.com 
> 
>  
>  
>
> 
>
>>


Re: [EXTERNAL] Cassandra 3.11.X upgrades

2020-02-12 Thread Sergio
Thanks, everyone! @Jon
https://lists.apache.org/thread.html/rd18814bfba487824ca95a58191f4dcdb86f15c9bb66cf2bcc29ddf0b%40%3Cuser.cassandra.apache.org%3E

I have a side response to something that looks to be controversial with the
response from Anthony.
So is it safe to go to production in a 1TB cluster with vnodes = 4?
Do we need to follow these steps
https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html?
because
from Anthony response what I got that this is just an example and vnodes =
4 it is not ready for production.
https://lists.apache.org/thread.html/r21cd99fa269076d186a82a8b466eb925681373302dd7aa6bb26e5bde%40%3Cuser.cassandra.apache.org%3E

Best,

Sergio



Il giorno mer 12 feb 2020 alle ore 11:42 Durity, Sean R <
sean_r_dur...@homedepot.com> ha scritto:

> >>A while ago, on my first cluster
>
>
>
> Understatement used so effectively. Jon is a master.
>
>
>
>
>
>
>
> On Wed, Feb 12, 2020 at 11:02 AM Sergio  wrote:
>
> Thanks for your reply!
>
>
>
> So unless the sstable format has not been changed I can avoid to do that.
>
>
>
> Correct?
>
>
>
> Best,
>
>
>
> Sergio
>
>
>
> On Wed, Feb 12, 2020, 10:58 AM Durity, Sean R 
> wrote:
>
> Check the readme.txt for any upgrade notes, but the basic procedure is to:
>
>- Verify that nodetool upgradesstables has completed successfully on
>all nodes from any previous upgrade
>- Turn off repairs and any other streaming operations (add/remove
>nodes)
>- Stop an un-upgraded node (seeds first, preferably)
>- Install new binaries and configs on the down node
>- Restart that node and make sure it comes up clean (it will function
>normally in the cluster – even with mixed versions)
>- Repeat for all nodes
>- Run upgradesstables on each node (as many at a time as your load
>will allow). Minor upgrades usually don’t require this step (only if the
>sstable format has changed), but it is good to check.
>- NOTE: in most cases applications can keep running and will not
>notice much impact – unless the cluster is overloaded and a single node
>down causes impact.
>
>
>
>
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
>
>
> *From:* Sergio 
> *Sent:* Wednesday, February 12, 2020 11:36 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Cassandra 3.11.X upgrades
>
>
>
> Hi guys!
>
> How do you usually upgrade your cluster for minor version upgrades?
>
> I tried to add a node with 3.11.5 version to a test cluster with 3.11.4
> nodes.
>
> Is there any restriction?
>
> Best,
>
> Sergio
>
>
> --
>
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: [EXTERNAL] Cassandra 3.11.X upgrades

2020-02-12 Thread Sergio
Thanks for your reply!

So unless the sstable format has not been changed I can avoid to do that.

Correct?

Best,

Sergio

On Wed, Feb 12, 2020, 10:58 AM Durity, Sean R 
wrote:

> Check the readme.txt for any upgrade notes, but the basic procedure is to:
>
>- Verify that nodetool upgradesstables has completed successfully on
>all nodes from any previous upgrade
>- Turn off repairs and any other streaming operations (add/remove
>nodes)
>- Stop an un-upgraded node (seeds first, preferably)
>- Install new binaries and configs on the down node
>- Restart that node and make sure it comes up clean (it will function
>normally in the cluster – even with mixed versions)
>- Repeat for all nodes
>- Run upgradesstables on each node (as many at a time as your load
>will allow). Minor upgrades usually don’t require this step (only if the
>sstable format has changed), but it is good to check.
>- NOTE: in most cases applications can keep running and will not
>notice much impact – unless the cluster is overloaded and a single node
>down causes impact.
>
>
>
>
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
>
>
> *From:* Sergio 
> *Sent:* Wednesday, February 12, 2020 11:36 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Cassandra 3.11.X upgrades
>
>
>
> Hi guys!
>
> How do you usually upgrade your cluster for minor version upgrades?
>
> I tried to add a node with 3.11.5 version to a test cluster with 3.11.4
> nodes.
>
> Is there any restriction?
>
> Best,
>
> Sergio
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


Re: How to elect a normal node to a seed node

2020-02-12 Thread Sergio
So if
1) I stop the a Cassandra node that doesn't have in the seeds IP list
itself
2) I change the cassandra.yaml of this node and I add it to the seed list
3) I restart the node

It will work completely fine and this is not even necessary.

This means that from the client driver perspective when I define the
contact points I can specify any node in the cluster as contact point and
not necessary a seed node?

Best,

Sergio


On Wed, Feb 12, 2020, 9:08 AM Arvinder Dhillon 
wrote:

> I believe seed nodes are not special nodes, it's just that you choose a
> few nodes from cluster that helps to bootstrap new joining nodes. You can
> change Cassandra.yaml to make any other node as seed node. There's nothing
> like promotion.
>
> -Arvinder
>
> On Wed, Feb 12, 2020, 8:37 AM Sergio  wrote:
>
>> Hi guys!
>>
>> Is there a way to promote a not seed node to a seed node?
>>
>> If yes, how do you do it?
>>
>> Thanks!
>>
>


How to elect a normal node to a seed node

2020-02-12 Thread Sergio
Hi guys!

Is there a way to promote a not seed node to a seed node?

If yes, how do you do it?

Thanks!


Cassandra 3.11.X upgrades

2020-02-12 Thread Sergio
Hi guys!

How do you usually upgrade your cluster for minor version upgrades?

I tried to add a node with 3.11.5 version to a test cluster with 3.11.4
nodes.

Is there any restriction?

Best,

Sergio


Re: [EXTERNAL] How to reduce vnodes without downtime

2020-02-11 Thread Sergio
Do you have any chance to take a look about this one?

Il giorno lun 3 feb 2020 alle ore 23:36 Sergio 
ha scritto:

> After reading this
>
> *I would only consider moving a cluster to 4 tokens if it is larger than
> 100 nodes. If you read through the paper that Erick mentioned, written
> by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
> availability of large scale clusters.*
>
> and
>
> With 16 tokens, that is vastly improved, but you still have up to 64 nodes
> each node needs to query against, so you're again, hitting every node
> unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I
> wouldn't use 16 here, and I doubt any of you would either.  I've advocated
> for 4 tokens because you'd have overlap with only 16 nodes, which works
> well for small clusters as well as large.  Assuming I was creating a new
> cluster for myself (in a hypothetical brand new application I'm building) I
> would put this in production.  I have worked with several teams where I
> helped them put 4 token clusters in prod and it has worked very well.  We
> didn't see any wild imbalance issues.
>
> from
> https://lists.apache.org/thread.html/r55d8e68483aea30010a4162ae94e92bc63ed74d486e6c642ee66f6ae%40%3Cuser.cassandra.apache.org%3E
>
> Sorry guys, but I am kinda confused now which should be the recommended
> approach for the number of *vnodes*.
> Right now I am handling a cluster with just 9 nodes and a data size of
> 100-200GB per node.
>
> I am seeing some unbalancing and I was worried because I have 256 vnodes
>
> --  Address  Load   Tokens   OwnsHost ID
> Rack
> UN  10.1.30.112  115.88 GiB  256  ?
> e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
> UN  10.1.24.146  127.42 GiB  256  ?
> adf40fa3-86c4-42c3-bf0a-0f3ee1651696  us-east-1b
> UN  10.1.26.181  133.44 GiB  256  ?
> 0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
> UN  10.1.29.202  113.33 GiB  256  ?
> d260d719-eae3-48ab-8a98-ea5c7b8f6eb6  us-east-1b
> UN  10.1.31.60   183.63 GiB  256  ?
> 3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
> UN  10.1.24.175  118.09 GiB  256  ?
> bba1e80b-8156-4399-bd6a-1b5ccb47bddb  us-east-1b
> UN  10.1.29.223  137.24 GiB  256  ?
> 450fbb61-3817-419a-a4c6-4b652eb5ce01  us-east-1b
>
> Weird stuff is related to this post
> <https://lists.apache.org/thread.html/r92279215bb2e169848cc2b15d320b8a15bfcf1db2dae79d5662c97c5%40%3Cuser.cassandra.apache.org%3E>
> where I don't find a match between the load and du -sh * for the node
> 10.1.31.60 and I was trying to figure out the reason, if it was due to the
> number of vnodes.
>
> 2 Out-of-topic questions:
>
> 1)
> Does Cassandra keep a copy of the data per rack so if I need to keep the
> things balanced and I would have to add 3 racks at the time in a single
> Datacenter keep the things balanced?
>
> 2) Is it better to keep a single Rack with a single Datacenter in 3
> different availability zones with replication factor = 3 or to have for
> each Datacenter: 1 Rack and 1 Availability Zone and eventually redirect the
> client to a fallback Datacenter in case one of the availability zone is not
> reachable?
>
> Right now we are separating the Datacenter for reads from the one that
> handles the writes...
>
> Thanks for your help!
>
> Sergio
>
>
>
>
> Il giorno dom 2 feb 2020 alle ore 18:36 Anthony Grasso <
> anthony.gra...@gmail.com> ha scritto:
>
>> Hi Sergio,
>>
>> There is a misunderstanding here. My post makes no recommendation for the
>> value of num_tokens. Rather, it focuses on how to use
>> the allocate_tokens_for_keyspace setting when creating a new cluster.
>>
>> Whilst a value of 4 is used for num_tokens in the post, it was chosen for
>> demonstration purposes. Specifically it makes:
>>
>>- the uneven token distribution in a small cluster very obvious,
>>- identifying the endpoints displayed in nodetool ring easy, and
>>- the initial_token setup less verbose and easier to follow.
>>
>> I will add an editorial note to the post with the above information
>> so there is no confusion about why 4 tokens were used.
>>
>> I would only consider moving a cluster to 4 tokens if it is larger than
>> 100 nodes. If you read through the paper that Erick mentioned, written
>> by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
>> availability of large scale clusters.
>>
>> If you are after more details about the trade-offs between different
>> sized token values, please see the discussion on the dev mailing list: 
>> "[Discuss]
>> 

Re: sstableloader: How much does it actually need?

2020-02-05 Thread Sergio
Another option is the DSE-bulk loader but it will require to convert to
csv/json (good option if you don't like to play with sstableloader and deal
to get all the sstables from all the nodes)
https://docs.datastax.com/en/dsbulk/doc/index.html

Cheers

Sergio

Il giorno mer 5 feb 2020 alle ore 16:56 Erick Ramirez 
ha scritto:

> Unfortunately, there isn't a guarantee that 2 nodes alone will have the
> full copy of data. I'd rather not say "it depends". 
>
> TIP: If the nodes in the target cluster have identical tokens allocated,
> you can just do a straight copy of the sstables node-for-node then do nodetool
> refresh. If the target cluster is already built and you can't assign the
> same tokens then sstableloader is your only option. Cheers!
>
> P.S. No need to apologise for asking questions. That's what we're all here
> for. Just keep them coming. 
>
>>


Re: nodetool load does not match du

2020-02-03 Thread Sergio
   -

   The amount of file system data under the cassandra data directory after
   excluding all content in the snapshots subdirectories. Because all SSTable
   data files are included, any data that is not cleaned up, such as
   TTL-expired cell or tombstoned data) is counted.

https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/tools/toolsStatus.html



Il giorno lun 3 feb 2020 alle ore 23:43 Sergio 
ha scritto:

> Thanks, Erick!
>
> I thought that the snapshot size was not counted in the load.
>
> Il giorno lun 3 feb 2020 alle ore 23:24 Erick Ramirez <
> flightc...@gmail.com> ha scritto:
>
>> Why the df -h and du -sh shows a big discrepancy? nodetool load is it
>>> computed with df -h?
>>>
>>
>> In Linux terms, df reports the filesystem disk usage while du is an
>> *estimate* of the file space usage. What that means is that the operating
>> system uses different accounting between the two utilities. If you're
>> looking for a more detailed explanation, just do a search for "df vs du".
>>
>> With nodetool load, do you have any snapshots still on disk? This usually
>> accounts for the discrepancy. Snapshots are hard links to the same inodes
>> as the original SSTables -- put simply, they're "pointers" to the original
>> files so they don't occupy the same amount of space.
>>
>> If you think there's a real issue, one way to troubleshoot is to do a du
>> on the table subdirectory then compare it to the size reported by nodetool
>> tablestats . Cheers!
>>
>


Re: nodetool load does not match du

2020-02-03 Thread Sergio
Thanks, Erick!

I thought that the snapshot size was not counted in the load.

Il giorno lun 3 feb 2020 alle ore 23:24 Erick Ramirez 
ha scritto:

> Why the df -h and du -sh shows a big discrepancy? nodetool load is it
>> computed with df -h?
>>
>
> In Linux terms, df reports the filesystem disk usage while du is an
> *estimate* of the file space usage. What that means is that the operating
> system uses different accounting between the two utilities. If you're
> looking for a more detailed explanation, just do a search for "df vs du".
>
> With nodetool load, do you have any snapshots still on disk? This usually
> accounts for the discrepancy. Snapshots are hard links to the same inodes
> as the original SSTables -- put simply, they're "pointers" to the original
> files so they don't occupy the same amount of space.
>
> If you think there's a real issue, one way to troubleshoot is to do a du
> on the table subdirectory then compare it to the size reported by nodetool
> tablestats . Cheers!
>


Re: [EXTERNAL] How to reduce vnodes without downtime

2020-02-03 Thread Sergio
After reading this

*I would only consider moving a cluster to 4 tokens if it is larger than
100 nodes. If you read through the paper that Erick mentioned, written
by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
availability of large scale clusters.*

and

With 16 tokens, that is vastly improved, but you still have up to 64 nodes
each node needs to query against, so you're again, hitting every node
unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I
wouldn't use 16 here, and I doubt any of you would either.  I've advocated
for 4 tokens because you'd have overlap with only 16 nodes, which works
well for small clusters as well as large.  Assuming I was creating a new
cluster for myself (in a hypothetical brand new application I'm building) I
would put this in production.  I have worked with several teams where I
helped them put 4 token clusters in prod and it has worked very well.  We
didn't see any wild imbalance issues.

from
https://lists.apache.org/thread.html/r55d8e68483aea30010a4162ae94e92bc63ed74d486e6c642ee66f6ae%40%3Cuser.cassandra.apache.org%3E

Sorry guys, but I am kinda confused now which should be the recommended
approach for the number of *vnodes*.
Right now I am handling a cluster with just 9 nodes and a data size of
100-200GB per node.

I am seeing some unbalancing and I was worried because I have 256 vnodes

--  Address  Load   Tokens   OwnsHost ID
Rack
UN  10.1.30.112  115.88 GiB  256  ?
e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
UN  10.1.24.146  127.42 GiB  256  ?
adf40fa3-86c4-42c3-bf0a-0f3ee1651696  us-east-1b
UN  10.1.26.181  133.44 GiB  256  ?
0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
UN  10.1.29.202  113.33 GiB  256  ?
d260d719-eae3-48ab-8a98-ea5c7b8f6eb6  us-east-1b
UN  10.1.31.60   183.63 GiB  256  ?
3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
UN  10.1.24.175  118.09 GiB  256  ?
bba1e80b-8156-4399-bd6a-1b5ccb47bddb  us-east-1b
UN  10.1.29.223  137.24 GiB  256  ?
450fbb61-3817-419a-a4c6-4b652eb5ce01  us-east-1b

Weird stuff is related to this post
<https://lists.apache.org/thread.html/r92279215bb2e169848cc2b15d320b8a15bfcf1db2dae79d5662c97c5%40%3Cuser.cassandra.apache.org%3E>
where I don't find a match between the load and du -sh * for the node
10.1.31.60 and I was trying to figure out the reason, if it was due to the
number of vnodes.

2 Out-of-topic questions:

1)
Does Cassandra keep a copy of the data per rack so if I need to keep the
things balanced and I would have to add 3 racks at the time in a single
Datacenter keep the things balanced?

2) Is it better to keep a single Rack with a single Datacenter in 3
different availability zones with replication factor = 3 or to have for
each Datacenter: 1 Rack and 1 Availability Zone and eventually redirect the
client to a fallback Datacenter in case one of the availability zone is not
reachable?

Right now we are separating the Datacenter for reads from the one that
handles the writes...

Thanks for your help!

Sergio




Il giorno dom 2 feb 2020 alle ore 18:36 Anthony Grasso <
anthony.gra...@gmail.com> ha scritto:

> Hi Sergio,
>
> There is a misunderstanding here. My post makes no recommendation for the
> value of num_tokens. Rather, it focuses on how to use
> the allocate_tokens_for_keyspace setting when creating a new cluster.
>
> Whilst a value of 4 is used for num_tokens in the post, it was chosen for
> demonstration purposes. Specifically it makes:
>
>- the uneven token distribution in a small cluster very obvious,
>- identifying the endpoints displayed in nodetool ring easy, and
>- the initial_token setup less verbose and easier to follow.
>
> I will add an editorial note to the post with the above information
> so there is no confusion about why 4 tokens were used.
>
> I would only consider moving a cluster to 4 tokens if it is larger than
> 100 nodes. If you read through the paper that Erick mentioned, written
> by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
> availability of large scale clusters.
>
> If you are after more details about the trade-offs between different sized
> token values, please see the discussion on the dev mailing list: "[Discuss]
> num_tokens default in Cassandra 4.0
> <https://www.mail-archive.com/search?l=dev%40cassandra.apache.org=subject%3A%22%5C%5BDiscuss%5C%5D+num_tokens+default+in+Cassandra+4.0%22=oldest>
> ".
>
> Regards,
> Anthony
>
> On Sat, 1 Feb 2020 at 10:07, Sergio  wrote:
>
>>
>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>>  This
>> is the article with 4 token recommendations.
>> @Erick Ramirez. which is the dev thread for the default 32 tokens
>> recommendation?
>>
>> Thanks,
&g

nodetool load does not match du

2020-02-03 Thread Sergio Bilello
Hello!
I was trying to understand the below differences:
Cassandra 3.11.4
i3xlarge aws nodes

$ du -sh /mnt
123G/mnt

$ nodetool info
ID : 3647fcca-688a-4851-ab15-df36819910f4
Gossip active  : true
Thrift active  : true
Native Transport active: true
Load   : 183.55 GiB
Generation No  : 1570757970
Uptime (seconds)   : 10041867
Heap Memory (MB)   : 3574.09 / 7664.00
Off Heap Memory (MB)   : 441.70
Data Center: live
Rack   : us-east-1b
Exceptions : 0
Key Cache  : entries 1430578, size 100 MiB, capacity 100 MiB, 
10075279019 hits, 13328775396 requests, 0.756 recent hit rate, 14400 save 
period in seconds
Row Cache  : entries 0, size 0 bytes, capacity 0 bytes, 0 hits, 0 
requests, NaN recent hit rate, 0 save period in seconds
Counter Cache  : entries 0, size 0 bytes, capacity 50 MiB, 0 hits, 0 
requests, NaN recent hit rate, 7200 save period in seconds
Chunk Cache: entries 7680, size 479.97 MiB, capacity 480 MiB, 
1835784783 misses, 11836353728 requests, 0.845 recent hit rate, 141.883 
microseconds miss latency
Percent Repaired   : 0.10752808456509523%
Token  : (invoke with -T/--tokens to see all 256 tokens)

$ df -h
Filesystem  Size  Used Avail Use% Mounted on
devtmpfs 15G 0   15G   0% /dev
tmpfs15G   72K   15G   1% /dev/shm
tmpfs15G  1.4G   14G  10% /run
tmpfs15G 0   15G   0% /sys/fs/cgroup
/dev/xvda1   50G  9.9G   41G  20% /
/dev/nvme0n1885G  181G  705G  21% /mnt
tmpfs   3.0G 0  3.0G   0% /run/user/995
tmpfs   3.0G 0  3.0G   0% /run/user/1009

Why the df -h and du -sh shows a big discrepancy? nodetool load is it computed 
with df -h?



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: [EXTERNAL] How to reduce vnodes without downtime

2020-02-03 Thread Sergio
Thanks Erick!

Best,

Sergio

On Sun, Feb 2, 2020, 10:07 PM Erick Ramirez  wrote:

> If you are after more details about the trade-offs between different sized
>> token values, please see the discussion on the dev mailing list: "[Discuss]
>> num_tokens default in Cassandra 4.0
>> <https://www.mail-archive.com/search?l=dev%40cassandra.apache.org=subject%3A%22%5C%5BDiscuss%5C%5D+num_tokens+default+in+Cassandra+4.0%22=oldest>
>> ".
>>
>> Regards,
>> Anthony
>>
>> On Sat, 1 Feb 2020 at 10:07, Sergio  wrote:
>>
>>>
>>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>>>  This
>>> is the article with 4 token recommendations.
>>> @Erick Ramirez. which is the dev thread for the default 32 tokens
>>> recommendation?
>>>
>>> Thanks,
>>> Sergio
>>>
>>
> Sergio, my apologies for not replying. For some reason, your reply went to
> my spam folder and I didn't see it.
>
> Thanks, Anthony, for responding. I was indeed referring to that dev
> thread. Cheers!
>
>


Re: [EXTERNAL] How to reduce vnodes without downtime

2020-02-02 Thread Sergio
Thanks Anthony!

I will read more about it

Best,

Sergio



Il giorno dom 2 feb 2020 alle ore 18:36 Anthony Grasso <
anthony.gra...@gmail.com> ha scritto:

> Hi Sergio,
>
> There is a misunderstanding here. My post makes no recommendation for the
> value of num_tokens. Rather, it focuses on how to use
> the allocate_tokens_for_keyspace setting when creating a new cluster.
>
> Whilst a value of 4 is used for num_tokens in the post, it was chosen for
> demonstration purposes. Specifically it makes:
>
>- the uneven token distribution in a small cluster very obvious,
>- identifying the endpoints displayed in nodetool ring easy, and
>- the initial_token setup less verbose and easier to follow.
>
> I will add an editorial note to the post with the above information
> so there is no confusion about why 4 tokens were used.
>
> I would only consider moving a cluster to 4 tokens if it is larger than
> 100 nodes. If you read through the paper that Erick mentioned, written
> by Joe Lynch & Josh Snyder, they show that the num_tokens impacts the
> availability of large scale clusters.
>
> If you are after more details about the trade-offs between different sized
> token values, please see the discussion on the dev mailing list: "[Discuss]
> num_tokens default in Cassandra 4.0
> <https://www.mail-archive.com/search?l=dev%40cassandra.apache.org=subject%3A%22%5C%5BDiscuss%5C%5D+num_tokens+default+in+Cassandra+4.0%22=oldest>
> ".
>
> Regards,
> Anthony
>
> On Sat, 1 Feb 2020 at 10:07, Sergio  wrote:
>
>>
>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>>  This
>> is the article with 4 token recommendations.
>> @Erick Ramirez. which is the dev thread for the default 32 tokens
>> recommendation?
>>
>> Thanks,
>> Sergio
>>
>> Il giorno ven 31 gen 2020 alle ore 14:49 Erick Ramirez <
>> flightc...@gmail.com> ha scritto:
>>
>>> There's an active discussion going on right now in a separate dev
>>> thread. The current "default recommendation" is 32 tokens. But there's a
>>> push for 4 in combination with allocate_tokens_for_keyspace from Jon
>>> Haddad & co (based on a paper from Joe Lynch & Josh Snyder).
>>>
>>> If you're satisfied with the results from your own testing, go with 4
>>> tokens. And that's the key -- you must test, test, TEST! Cheers!
>>>
>>> On Sat, Feb 1, 2020 at 5:17 AM Arvinder Dhillon 
>>> wrote:
>>>
>>>> What is recommended vnodes now? I read 8 in later cassandra 3.x
>>>> Is the new recommendation 4 now even in version 3.x (asking for 3.11)?
>>>> Thanks
>>>>
>>>> On Fri, Jan 31, 2020 at 9:49 AM Durity, Sean R <
>>>> sean_r_dur...@homedepot.com> wrote:
>>>>
>>>>> These are good clarifications and expansions.
>>>>>
>>>>>
>>>>>
>>>>> Sean Durity
>>>>>
>>>>>
>>>>>
>>>>> *From:* Anthony Grasso 
>>>>> *Sent:* Thursday, January 30, 2020 7:25 PM
>>>>> *To:* user 
>>>>> *Subject:* Re: [EXTERNAL] How to reduce vnodes without downtime
>>>>>
>>>>>
>>>>>
>>>>> Hi Maxim,
>>>>>
>>>>>
>>>>>
>>>>> Basically what Sean suggested is the way to do this without downtime.
>>>>>
>>>>>
>>>>>
>>>>> To clarify the, the *three* steps following the "Decommission each
>>>>> node in the DC you are working on" step should be applied to *only*
>>>>> the decommissioned nodes. So where it say "*all nodes*" or "*every
>>>>> node*" it applies to only the decommissioned nodes.
>>>>>
>>>>>
>>>>>
>>>>> In addition, the step that says "Wipe data on all the nodes", I would
>>>>> delete all files in the following directories on the decommissioned nodes.
>>>>>
>>>>>- data (usually located in /var/lib/cassandra/data)
>>>>>- commitlogs (usually located in /var/lib/cassandra/commitlogs)
>>>>>- hints (usually located in /var/lib/casandra/hints)
>>>>>- saved_caches (usually located in /var/lib/cassandra/saved_caches)
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Anthony
>>>>>
>>>&g

Re: [EXTERNAL] How to reduce vnodes without downtime

2020-01-31 Thread Sergio
https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
This
is the article with 4 token recommendations.
@Erick Ramirez. which is the dev thread for the default 32 tokens
recommendation?

Thanks,
Sergio

Il giorno ven 31 gen 2020 alle ore 14:49 Erick Ramirez 
ha scritto:

> There's an active discussion going on right now in a separate dev thread.
> The current "default recommendation" is 32 tokens. But there's a push for 4
> in combination with allocate_tokens_for_keyspace from Jon Haddad & co
> (based on a paper from Joe Lynch & Josh Snyder).
>
> If you're satisfied with the results from your own testing, go with 4
> tokens. And that's the key -- you must test, test, TEST! Cheers!
>
> On Sat, Feb 1, 2020 at 5:17 AM Arvinder Dhillon 
> wrote:
>
>> What is recommended vnodes now? I read 8 in later cassandra 3.x
>> Is the new recommendation 4 now even in version 3.x (asking for 3.11)?
>> Thanks
>>
>> On Fri, Jan 31, 2020 at 9:49 AM Durity, Sean R <
>> sean_r_dur...@homedepot.com> wrote:
>>
>>> These are good clarifications and expansions.
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Anthony Grasso 
>>> *Sent:* Thursday, January 30, 2020 7:25 PM
>>> *To:* user 
>>> *Subject:* Re: [EXTERNAL] How to reduce vnodes without downtime
>>>
>>>
>>>
>>> Hi Maxim,
>>>
>>>
>>>
>>> Basically what Sean suggested is the way to do this without downtime.
>>>
>>>
>>>
>>> To clarify the, the *three* steps following the "Decommission each node
>>> in the DC you are working on" step should be applied to *only* the
>>> decommissioned nodes. So where it say "*all nodes*" or "*every node*"
>>> it applies to only the decommissioned nodes.
>>>
>>>
>>>
>>> In addition, the step that says "Wipe data on all the nodes", I would
>>> delete all files in the following directories on the decommissioned nodes.
>>>
>>>- data (usually located in /var/lib/cassandra/data)
>>>- commitlogs (usually located in /var/lib/cassandra/commitlogs)
>>>- hints (usually located in /var/lib/casandra/hints)
>>>- saved_caches (usually located in /var/lib/cassandra/saved_caches)
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Anthony
>>>
>>>
>>>
>>> On Fri, 31 Jan 2020 at 03:05, Durity, Sean R <
>>> sean_r_dur...@homedepot.com> wrote:
>>>
>>> Your procedure won’t work very well. On the first node, if you switched
>>> to 4, you would end up with only a tiny fraction of the data (because the
>>> other nodes would still be at 256). I updated a large cluster (over 150
>>> nodes – 2 DCs) to smaller number of vnodes. The basic outline was this:
>>>
>>>
>>>
>>>- Stop all repairs
>>>- Make sure the app is running against one DC only
>>>- Change the replication settings on keyspaces to use only 1 DC
>>>(basically cutting off the other DC)
>>>- Decommission each node in the DC you are working on. Because the
>>>replication setting are changed, no streaming occurs. But it releases the
>>>token assignments
>>>- Wipe data on all the nodes
>>>- Update configuration on every node to your new settings, including
>>>auto_bootstrap = false
>>>- Start all nodes. They will choose tokens, but not stream any data
>>>- Update replication factor for all keyspaces to include the new DC
>>>- I disabled binary on those nodes to prevent app connections
>>>- Run nodetool reduild with -dc (other DC) on as many nodes as your
>>>system can safely handle until they are all rebuilt.
>>>- Re-enable binary (and app connections to the rebuilt DC)
>>>- Turn on repairs
>>>- Rest for a bit, then reverse the process for the remaining DCs
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> Sean Durity – Staff Systems Engineer, Cassandra
>>>
>>>
>>>
>>> *From:* Maxim Parkachov 
>>> *Sent:* Thursday, January 30, 2020 10:05 AM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* [EXTERNAL] How to reduce vnodes without downtime
>>>
>>>
>>>
>>> Hi everyone,
>>>
>>>
>>>
>>> with discussion about reduc

Re: sstableloader & num_tokens change

2020-01-24 Thread Sergio
https://docs.datastax.com/en/dsbulk/doc/dsbulk/reference/dsbulkLoad.html

Just skimming through the docs

I see examples by loading from CSV / JSON

Maybe there is some other command or doc page that I am missing




On Fri, Jan 24, 2020, 9:10 AM Nitan Kainth  wrote:

> Dsbulk works same as sstableloder.
>
>
> Regards,
>
> Nitan
>
> Cell: 510 449 9629
>
> On Jan 24, 2020, at 10:40 AM, Sergio  wrote:
>
> 
> I was wondering if that improvement for token allocation would work even
> with just one rack. It should but I am not sure.
>
> Does Dsbulk support migration cluster to cluster without CSV or JSON
> export?
>
> Thanks and Regards
>
> On Fri, Jan 24, 2020, 8:34 AM Nitan Kainth  wrote:
>
>> Instead of sstableloader consider dsbulk by datastax.
>>
>> On Fri, Jan 24, 2020 at 10:20 AM Reid Pinchback <
>> rpinchb...@tripadvisor.com> wrote:
>>
>>> Jon Haddad has previously made the case for num_tokens=4.  His
>>> Accelerate 2019 talk is available at:
>>>
>>>
>>>
>>> https://www.youtube.com/watch?v=swL7bCnolkU
>>>
>>>
>>>
>>> You might want to check that out.  Also I think the amount of effort you
>>> put into evening out the token distribution increases as vnode count
>>> shrinks.  The caveats are explored at:
>>>
>>>
>>>
>>>
>>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>>>
>>>
>>>
>>>
>>>
>>> *From: *Voytek Jarnot 
>>> *Reply-To: *"user@cassandra.apache.org" 
>>> *Date: *Friday, January 24, 2020 at 10:39 AM
>>> *To: *"user@cassandra.apache.org" 
>>> *Subject: *sstableloader & num_tokens change
>>>
>>>
>>>
>>> *Message from External Sender*
>>>
>>> Running 3.11.x, 4 nodes RF=3, default 256 tokens; moving to a different
>>> 4 node RF=3 cluster.
>>>
>>>
>>>
>>> I've read that 256 is not an optimal default num_tokens value, and that
>>> 32 is likely a better option.
>>>
>>>
>>>
>>> We have the "opportunity" to switch, as we're migrating environments and
>>> will likely be using sstableloader to do so. I'm curious if there are any
>>> gotchas with using sstableloader to restore snapshots taken from 256-token
>>> nodes into a cluster with 32-token nodes (otherwise same # of nodes and
>>> same RF).
>>>
>>>
>>>
>>> Thanks in advance.
>>>
>>


Re: sstableloader & num_tokens change

2020-01-24 Thread Sergio
I was wondering if that improvement for token allocation would work even
with just one rack. It should but I am not sure.

Does Dsbulk support migration cluster to cluster without CSV or JSON export?

Thanks and Regards

On Fri, Jan 24, 2020, 8:34 AM Nitan Kainth  wrote:

> Instead of sstableloader consider dsbulk by datastax.
>
> On Fri, Jan 24, 2020 at 10:20 AM Reid Pinchback <
> rpinchb...@tripadvisor.com> wrote:
>
>> Jon Haddad has previously made the case for num_tokens=4.  His Accelerate
>> 2019 talk is available at:
>>
>>
>>
>> https://www.youtube.com/watch?v=swL7bCnolkU
>>
>>
>>
>> You might want to check that out.  Also I think the amount of effort you
>> put into evening out the token distribution increases as vnode count
>> shrinks.  The caveats are explored at:
>>
>>
>>
>>
>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>>
>>
>>
>>
>>
>> *From: *Voytek Jarnot 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Friday, January 24, 2020 at 10:39 AM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *sstableloader & num_tokens change
>>
>>
>>
>> *Message from External Sender*
>>
>> Running 3.11.x, 4 nodes RF=3, default 256 tokens; moving to a different 4
>> node RF=3 cluster.
>>
>>
>>
>> I've read that 256 is not an optimal default num_tokens value, and that
>> 32 is likely a better option.
>>
>>
>>
>> We have the "opportunity" to switch, as we're migrating environments and
>> will likely be using sstableloader to do so. I'm curious if there are any
>> gotchas with using sstableloader to restore snapshots taken from 256-token
>> nodes into a cluster with 32-token nodes (otherwise same # of nodes and
>> same RF).
>>
>>
>>
>> Thanks in advance.
>>
>


Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-22 Thread Sergio
Thanks for the explanation. It should deserve a blog post

Sergio

On Wed, Jan 22, 2020, 1:22 PM Reid Pinchback 
wrote:

> The reaper logs will say if nodes are being skipped.  The web UI isn’t
> that good at making it apparent.  You can sometimes tell it is likely
> happening when you see time gaps between parts of the repair.  This is for
> when nodes are skipped because of a timeout, but not only that.  The gaps
> are mostly controlled by the combined results of segmentCountPerNode,
> repairIntensity, and hangingRepairTimeoutMins.  The last of those three is
> the most obvious influence on timeouts, but the other two have some impact
> on the work attempted and the size of the time gaps.  However the C*
> version also has some bearing, as it influences how hard it is to process
> the data needed for repairs.
>
>
>
> The more subtle aspect of node skipping isn’t the hanging repairs.  When
> repair of a token range is first attempted, Reaper uses JMX to ask C* if a
> repair is already underway.  The way it asks is very simplistic, so it
> doesn’t mean a repair is underway for that particular token range.  It just
> means something looking like a repair is going on.  Basically it just asks
> “hey is there a thread with the right magic naming pattern?”  The problem I
> think is that when you get some repair activity triggered on reads and
> writes for inconsistent data, I believe they show up as these kinds of
> threads too.  If you have a bad usage pattern of C* (where you write then
> very soon read back) then logically you’d expect this to happen quite a lot.
>
>
>
> I’m not an expert on the internals since I’m not one of the C*
> contributors, but having stared at that part of the source quite a bit this
> year, that’s my take on what can happen.  And if I’m correct, that’s not a
> thing you can tune for. It is a consequence of C*-unfriendly usage patterns.
>
>
>
> Bottom line though is that tuning repairs is only something you do if you
> find that repairs are taking longer than makes sense to you.  It’s totally
> separate from the notion that you should be able to run reaper-controlled
> repairs at least 2x per gc grace seconds.  That’s just a case of making
> some observations on the arithmetic of time intervals.
>
>
>
>
>
> *From: *Sergio 
> *Date: *Wednesday, January 22, 2020 at 4:08 PM
> *To: *Reid Pinchback 
> *Cc: *"user@cassandra.apache.org" 
> *Subject: *Re: Is there any concern about increasing gc_grace_seconds
> from 5 days to 8 days?
>
>
>
> *Message from External Sender*
>
> Thank you very much for your extended response.
>
> Should I look in the log some particular message to detect such behavior?
>
> How do you tune it ?
>
>
>
> Thanks,
>
>
>
> Sergio
>
>
>
> On Wed, Jan 22, 2020, 12:59 PM Reid Pinchback 
> wrote:
>
> Kinda. It isn’t that you have to repair twice per se, just that the
> possibility of running repairs at least twice before GC grace seconds
> elapse means that clearly there is no chance of a tombstone not being
> subject to repair at least once before you hit your GC grace seconds.
>
>
>
> Imagine a tombstone being created on the very first node that Reaper
> looked at in a repair cycle, but one second after Reaper completed repair
> of that particular token range.  Repairs will be complete, but that
> particular tombstone just missed being part of the effort.
>
>
>
> Now your next repair run happens.  What if Reaper doesn’t look at that
> same node first?  It is easy to have happen, as there is a bunch of logic
> related to detection of existing repairs or things taking too long.  So the
> box that was “the first node” in that first repair run, through bad luck
> gets kicked down to later in the second run.  I’ve seen nodes get skipped
> multiple times (you can tune to reduce that, but bottom line… it happens).
> So, bad luck you’ve got.  Eventually the node does get repaired, and the
> aging tombstone finally gets removed.  All fine and dandy…
>
>
>
> Provided that the second repair run got to that point BEFORE you hit your
> GC grace seconds.
>
>
>
> That’s why you need enough time to run it twice.  Because you need enough
> time to catch the oldest possible tombstone, even if it is dealt with at
> the very end of a repair run.  Yes, it sounds like a bit of a degenerate
> case, but if you are writing a lot of data, the probability of not having
> the degenerate cases become real cases becomes vanishingly small.
>
>
>
> R
>
>
>
>
>
> *From: *Sergio 
> *Date: *Wednesday, January 22, 2020 at 1:41 PM
> *To: *"user@cassandra.apache.org" , Reid
> Pinchback 
> *Subject: *Re: Is there any concern 

Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-22 Thread Sergio
Thank you very much for your extended response.
Should I look in the log some particular message to detect such behavior?
How do you tune it ?

Thanks,

Sergio

On Wed, Jan 22, 2020, 12:59 PM Reid Pinchback 
wrote:

> Kinda. It isn’t that you have to repair twice per se, just that the
> possibility of running repairs at least twice before GC grace seconds
> elapse means that clearly there is no chance of a tombstone not being
> subject to repair at least once before you hit your GC grace seconds.
>
>
>
> Imagine a tombstone being created on the very first node that Reaper
> looked at in a repair cycle, but one second after Reaper completed repair
> of that particular token range.  Repairs will be complete, but that
> particular tombstone just missed being part of the effort.
>
>
>
> Now your next repair run happens.  What if Reaper doesn’t look at that
> same node first?  It is easy to have happen, as there is a bunch of logic
> related to detection of existing repairs or things taking too long.  So the
> box that was “the first node” in that first repair run, through bad luck
> gets kicked down to later in the second run.  I’ve seen nodes get skipped
> multiple times (you can tune to reduce that, but bottom line… it happens).
> So, bad luck you’ve got.  Eventually the node does get repaired, and the
> aging tombstone finally gets removed.  All fine and dandy…
>
>
>
> Provided that the second repair run got to that point BEFORE you hit your
> GC grace seconds.
>
>
>
> That’s why you need enough time to run it twice.  Because you need enough
> time to catch the oldest possible tombstone, even if it is dealt with at
> the very end of a repair run.  Yes, it sounds like a bit of a degenerate
> case, but if you are writing a lot of data, the probability of not having
> the degenerate cases become real cases becomes vanishingly small.
>
>
>
> R
>
>
>
>
>
> *From: *Sergio 
> *Date: *Wednesday, January 22, 2020 at 1:41 PM
> *To: *"user@cassandra.apache.org" , Reid
> Pinchback 
> *Subject: *Re: Is there any concern about increasing gc_grace_seconds
> from 5 days to 8 days?
>
>
>
> *Message from External Sender*
>
> I was wondering if I should always complete 2 repairs cycles with reaper
> even if one repair cycle finishes in 7 hours.
>
> Currently, I have around 200GB in column family data size to be repaired
> and I was scheduling once repair a week and I was not having too much
> stress on my 8 nodes cluster with i3xlarge nodes.
>
> Thanks,
>
> Sergio
>
>
>
> Il giorno mer 22 gen 2020 alle ore 08:28 Sergio 
> ha scritto:
>
> Thank you very much! Yes I am using reaper!
>
>
>
> Best,
>
>
>
> Sergio
>
>
>
> On Wed, Jan 22, 2020, 8:00 AM Reid Pinchback 
> wrote:
>
> Sergio, if you’re looking for a new frequency for your repairs because of
> the change, if you are using reaper, then I’d go for repair_freq <=
> gc_grace / 2.
>
>
>
> Just serendipity with a conversation I was having at work this morning.
> When you actually watch the reaper logs then you can see situations where
> unlucky timing with skipped nodes can make the time to remove a tombstone
> be up to 2 x repair_run_time.
>
>
>
> If you aren’t using reaper, your mileage will vary, particularly if your
> repairs are consistent in the ordering across nodes.  Reaper can be
> moderately non-deterministic hence the need to be sure you can complete at
> least two repair runs.
>
>
>
> R
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Tuesday, January 21, 2020 at 7:13 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Is there any concern about increasing gc_grace_seconds
> from 5 days to 8 days?
>
>
>
> *Message from External Sender*
>
> Thank you very much for your response.
>
> The considerations mentioned are the ones that I was expecting.
>
> I believe that I am good to go.
>
> I just wanted to make sure that there was no need to run any other extra
> command beside that one.
>
>
>
> Best,
>
>
>
> Sergio
>
>
>
> On Tue, Jan 21, 2020, 3:55 PM Jeff Jirsa  wrote:
>
> Note that if you're actually running repairs within 5 days, and you adjust
> this to 8, you may stream a bunch of tombstones across in that 5-8 day
> window, which can increase disk usage / compaction (because as you pass 5
> days, one replica may gc away the tombstones, the others may not because
> the tombstones shadow data, so you'll re-stream the tombstone to the other
> replicas)
>
>
>
> On Tue, Jan 21, 2020 at 3:28 PM Elliott Sims 
> wrote:
>
> In 

Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-22 Thread Sergio
I was wondering if I should always complete 2 repairs cycles with reaper
even if one repair cycle finishes in 7 hours.

Currently, I have around 200GB in column family data size to be repaired
and I was scheduling once repair a week and I was not having too much
stress on my 8 nodes cluster with i3xlarge nodes.

Thanks,

Sergio

Il giorno mer 22 gen 2020 alle ore 08:28 Sergio 
ha scritto:

> Thank you very much! Yes I am using reaper!
>
> Best,
>
> Sergio
>
> On Wed, Jan 22, 2020, 8:00 AM Reid Pinchback 
> wrote:
>
>> Sergio, if you’re looking for a new frequency for your repairs because of
>> the change, if you are using reaper, then I’d go for repair_freq <=
>> gc_grace / 2.
>>
>>
>>
>> Just serendipity with a conversation I was having at work this morning.
>> When you actually watch the reaper logs then you can see situations where
>> unlucky timing with skipped nodes can make the time to remove a tombstone
>> be up to 2 x repair_run_time.
>>
>>
>>
>> If you aren’t using reaper, your mileage will vary, particularly if your
>> repairs are consistent in the ordering across nodes.  Reaper can be
>> moderately non-deterministic hence the need to be sure you can complete at
>> least two repair runs.
>>
>>
>>
>> R
>>
>>
>>
>> *From: *Sergio 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Tuesday, January 21, 2020 at 7:13 PM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Re: Is there any concern about increasing gc_grace_seconds
>> from 5 days to 8 days?
>>
>>
>>
>> *Message from External Sender*
>>
>> Thank you very much for your response.
>>
>> The considerations mentioned are the ones that I was expecting.
>>
>> I believe that I am good to go.
>>
>> I just wanted to make sure that there was no need to run any other extra
>> command beside that one.
>>
>>
>>
>> Best,
>>
>>
>>
>> Sergio
>>
>>
>>
>> On Tue, Jan 21, 2020, 3:55 PM Jeff Jirsa  wrote:
>>
>> Note that if you're actually running repairs within 5 days, and you
>> adjust this to 8, you may stream a bunch of tombstones across in that 5-8
>> day window, which can increase disk usage / compaction (because as you pass
>> 5 days, one replica may gc away the tombstones, the others may not because
>> the tombstones shadow data, so you'll re-stream the tombstone to the other
>> replicas)
>>
>>
>>
>> On Tue, Jan 21, 2020 at 3:28 PM Elliott Sims 
>> wrote:
>>
>> In addition to extra space, queries can potentially be more expensive
>> because more dead rows and tombstones will need to be scanned.  How much of
>> a difference this makes will depend drastically on the schema and access
>> pattern, but I wouldn't expect going from 5 days to 8 to be very noticeable.
>>
>>
>>
>> On Tue, Jan 21, 2020 at 2:14 PM Sergio  wrote:
>>
>> https://stackoverflow.com/a/22030790
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_22030790=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=qt1NAYTks84VVQ4WGXWkK6pw85m3FcuUjPRJPdIHMdw=aEgz5F5HRxPT3w4hpfNXQRhcchwRjrpf7KB3QyywO_Q=>
>>
>>
>>
>> For CQLSH
>>
>> alter table  with GC_GRACE_SECONDS = ;
>>
>>
>>
>>
>>
>> Il giorno mar 21 gen 2020 alle ore 13:12 Sergio <
>> lapostadiser...@gmail.com> ha scritto:
>>
>> Hi guys!
>>
>> I just wanted to confirm with you before doing such an operation. I
>> expect to increase the space but nothing more than this. I  need to perform
>> just :
>>
>> UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days
>>
>> Is it correct?
>>
>> Thanks,
>>
>> Sergio
>>
>>


Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-22 Thread Sergio
Thank you very much! Yes I am using reaper!

Best,

Sergio

On Wed, Jan 22, 2020, 8:00 AM Reid Pinchback 
wrote:

> Sergio, if you’re looking for a new frequency for your repairs because of
> the change, if you are using reaper, then I’d go for repair_freq <=
> gc_grace / 2.
>
>
>
> Just serendipity with a conversation I was having at work this morning.
> When you actually watch the reaper logs then you can see situations where
> unlucky timing with skipped nodes can make the time to remove a tombstone
> be up to 2 x repair_run_time.
>
>
>
> If you aren’t using reaper, your mileage will vary, particularly if your
> repairs are consistent in the ordering across nodes.  Reaper can be
> moderately non-deterministic hence the need to be sure you can complete at
> least two repair runs.
>
>
>
> R
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Tuesday, January 21, 2020 at 7:13 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Is there any concern about increasing gc_grace_seconds
> from 5 days to 8 days?
>
>
>
> *Message from External Sender*
>
> Thank you very much for your response.
>
> The considerations mentioned are the ones that I was expecting.
>
> I believe that I am good to go.
>
> I just wanted to make sure that there was no need to run any other extra
> command beside that one.
>
>
>
> Best,
>
>
>
> Sergio
>
>
>
> On Tue, Jan 21, 2020, 3:55 PM Jeff Jirsa  wrote:
>
> Note that if you're actually running repairs within 5 days, and you adjust
> this to 8, you may stream a bunch of tombstones across in that 5-8 day
> window, which can increase disk usage / compaction (because as you pass 5
> days, one replica may gc away the tombstones, the others may not because
> the tombstones shadow data, so you'll re-stream the tombstone to the other
> replicas)
>
>
>
> On Tue, Jan 21, 2020 at 3:28 PM Elliott Sims 
> wrote:
>
> In addition to extra space, queries can potentially be more expensive
> because more dead rows and tombstones will need to be scanned.  How much of
> a difference this makes will depend drastically on the schema and access
> pattern, but I wouldn't expect going from 5 days to 8 to be very noticeable.
>
>
>
> On Tue, Jan 21, 2020 at 2:14 PM Sergio  wrote:
>
> https://stackoverflow.com/a/22030790
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_22030790=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=qt1NAYTks84VVQ4WGXWkK6pw85m3FcuUjPRJPdIHMdw=aEgz5F5HRxPT3w4hpfNXQRhcchwRjrpf7KB3QyywO_Q=>
>
>
>
> For CQLSH
>
> alter table  with GC_GRACE_SECONDS = ;
>
>
>
>
>
> Il giorno mar 21 gen 2020 alle ore 13:12 Sergio 
> ha scritto:
>
> Hi guys!
>
> I just wanted to confirm with you before doing such an operation. I expect
> to increase the space but nothing more than this. I  need to perform just :
>
> UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days
>
> Is it correct?
>
> Thanks,
>
> Sergio
>
>


Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-21 Thread Sergio
Thank you very much for your response.
The considerations mentioned are the ones that I was expecting.
I believe that I am good to go.
I just wanted to make sure that there was no need to run any other extra
command beside that one.

Best,

Sergio

On Tue, Jan 21, 2020, 3:55 PM Jeff Jirsa  wrote:

> Note that if you're actually running repairs within 5 days, and you adjust
> this to 8, you may stream a bunch of tombstones across in that 5-8 day
> window, which can increase disk usage / compaction (because as you pass 5
> days, one replica may gc away the tombstones, the others may not because
> the tombstones shadow data, so you'll re-stream the tombstone to the other
> replicas)
>
> On Tue, Jan 21, 2020 at 3:28 PM Elliott Sims 
> wrote:
>
>> In addition to extra space, queries can potentially be more expensive
>> because more dead rows and tombstones will need to be scanned.  How much of
>> a difference this makes will depend drastically on the schema and access
>> pattern, but I wouldn't expect going from 5 days to 8 to be very noticeable.
>>
>> On Tue, Jan 21, 2020 at 2:14 PM Sergio  wrote:
>>
>>> https://stackoverflow.com/a/22030790
>>>
>>>
>>> For CQLSH
>>>
>>> alter table  with GC_GRACE_SECONDS = ;
>>>
>>>
>>>
>>> Il giorno mar 21 gen 2020 alle ore 13:12 Sergio <
>>> lapostadiser...@gmail.com> ha scritto:
>>>
>>>> Hi guys!
>>>>
>>>> I just wanted to confirm with you before doing such an operation. I
>>>> expect to increase the space but nothing more than this. I  need to perform
>>>> just :
>>>>
>>>> UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days
>>>>
>>>> Is it correct?
>>>>
>>>> Thanks,
>>>>
>>>> Sergio
>>>>
>>>


Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-21 Thread Sergio
https://stackoverflow.com/a/22030790


For CQLSH

alter table  with GC_GRACE_SECONDS = ;



Il giorno mar 21 gen 2020 alle ore 13:12 Sergio 
ha scritto:

> Hi guys!
>
> I just wanted to confirm with you before doing such an operation. I expect
> to increase the space but nothing more than this. I  need to perform just :
>
> UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days
>
> Is it correct?
>
> Thanks,
>
> Sergio
>


Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-21 Thread Sergio
Hi guys!

I just wanted to confirm with you before doing such an operation. I expect
to increase the space but nothing more than this. I  need to perform just :

UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days

Is it correct?

Thanks,

Sergio


Re: [EXTERNAL] Re: *URGENT* Migration across different Cassandra cluster few having same keyspace/table names

2020-01-17 Thread Sergio
Hi everyone,

Is the DSE BulkLoader faster than the sstableloader?

Sometimes I need to make a cluster snapshot and replicate a Cluster A to a
Cluster B  with fewer performance capabilities but the same data size.

In terms of speed, the sstableloader should be faster correct?

Maybe the DSE BulkLoader finds application when you want a slice of the
data and not the entire cake. Is it correct?

Thanks,

Sergio


JMX Metrics [Cassandra-Stress-Tool VS JMX]

2019-11-22 Thread Sergio Bilello
Hi everyone!

Which function has to be used with each JMX Metric Type?

https://cassandra.apache.org/doc/latest/operating/metrics.html

https://www.datadoghq.com/blog/how-to-collect-cassandra-metrics/

For example: to compute the read latency I did a ratio between 
ReadTotalLatency_Count JMX Counter and ReadLatency_Count JMX Timer, and the 
number corresponds to the one exposed via nodetool tablestats 
.

How should I consider the attributes: 95thPercentile, Mean etc... from 
ReadLatency bean 
org.apache.cassandra.metrics:type=Table,keyspace=test,scope=test_column_family,name=ReadLatency
I found also a grafana open-source dashboard 
https://grafana.com/grafana/dashboards/5408 but I am not convinced about the 
displayed numbers if I compare with the numbers shown by the cassandra-stress 
tool.

If I want the QPS does it make sense to use rate(WriteLatencyCount[5m]) in 
grafana ?

The latency computed by the cassandra-stress-tool should almost match the 
latency shown by the JMX metrics or not?

Which one do you monitor ClientRequest metrics or Table metrics or ColumnFamily?

I am going to create my Grafana dashboard and explain how I configured it.

Best,

Sergio

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Find large partition https://github.com/tolbertam/sstable-tools

2019-11-22 Thread Sergio Bilello
Thanks! I will look into it

On 2019/11/22 19:22:15, Jeff Jirsa  wrote: 
> Brian Gallew has a very simple script that does something similar:
> https://github.com/BrianGallew/cassandra_tools/blob/master/poison_pill_tester
> 
> You can also search the logs for messages about writing large partitions
> during compaction.
> 
> 
> 
> 
> 
> On Thu, Nov 21, 2019 at 6:33 PM Sergio Bilello 
> wrote:
> 
> > Hi guys!
> > Just for curiosity do you know anything beside
> > https://github.com/tolbertam/sstable-tools to find a large partition?
> > Best,
> >
> > Sergio
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Find large partition https://github.com/tolbertam/sstable-tools

2019-11-21 Thread Sergio Bilello
Hi guys!
Just for curiosity do you know anything beside 
https://github.com/tolbertam/sstable-tools to find a large partition?
Best,

Sergio

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-11-01 Thread Sergio
Hi Reid,

Thank you for your extensive response. I don't think that we have such a
person and in any case, even if I am a Software Engineer I would be curious
to deep dive into the problem and understand the reason. The only
observation that I have right now is that I have in the same cluster 2
keyspaces and 3 datacenters.
Only the Cassandra Nodes that are serving a particular Datacenter and
Keyspace is having thousands TCP connections established and I see these
connections only from some clients.
We have 2 kinds of clients and those have been built with 2 different
approaches: Spring Cassandra Reactive and the other one with the Java
Cassandra driver without any wrapper.
I don't know a lot about it the latter one since I didn't write that code.
I want to share that just one note I asked to add LatencyAwarePolicy in the
JAVA Cassandra Driver and this decreased tremendously the CPU LOAD for any
new Cassandra node joining the cluster. I am thinking that there could be
some driver configuration that is not correct?!
I will verify my theory and I will share the results later on for the
interested reader or maybe to help someone that found the same bizarre
behavior.
However, even with thousands connection opened the load is below 3 in a 4
CPU machine and the latency is good.


Thanks and have a great weekend
Sergio




Il giorno ven 1 nov 2019 alle ore 07:56 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Hi Sergio,
>
>
>
> I’m definitely not enough of a network wonk to make definitive statements
> on network configuration, finding your in-company network expert is
> definitely going to be a lot more productive.  I’ve forgotten if you are
> on-prem or in AWS, so if in AWS replace “your network wonk” with “your AWS
> support contact” if you’re paying for support.  I will make two more
> concrete observations though, and you can run these notions down as
> appropriate.
>
>
>
> When C* starts up, see if the logs contain a warning about jemalloc not
> being detected.  That’s something we missed in our 3.11.4 setup and is on
> my todo list to circle back around to evaluate later.  JVMs have some
> rather complicated memory management that relates to efficient allocation
> of memory to threads (this isn’t strictly a JVM thing, but JVMs definitely
> care).  If you have high connection counts, I can see that likely mattering
> to you.  Also, as part of that, the memory arena setting of 4 that is
> Cassandra’s default may not be the right one for you.  The more concurrency
> you have, the more that number may need to bump up to avoid contention on
> memory allocations.  We haven’t played with it because our simultaneous
> connection counts are modest.  Note that Cassandra can create a lot of
> threads but many of them have low activity so I think it’s more about how
> many area actually active.  Large connection counts will move the needle up
> on you and may motivate tuning the arena count.
>
>
>
> When talking to your network person, I’d see what they think about C*’s
> defaults on TCP_NODELAY vs delayed ACKs.  The Datastax docs say that the
> TCP_NODELAY default setting is false in C*, but I looked in the 3.11.4
> source and the default is coded as true.  It’s only via the config file
> samples that bounce around that it typically gets set to false.  There are
> times where Nagle and delayed ACKs don’t play well together and induce
> stalls.  I’m not the person to help you investigate that because it gets a
> bit gnarly on the details (for example, a refinement to the Nagle algorithm
> was proposed in the 1990’s that exists in some OS’s and can make my
> comments here moot).  Somebody who lives this stuff will be a more
> definitive source, but you are welcome to copy-paste my thoughts to them
> for context.
>
>
>
> R
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Wednesday, October 30, 2019 at 5:56 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Cassandra 3.11.4 Node the load starts to increase after
> few minutes to 40 on 4 CPU machine
>
>
>
> *Message from External Sender*
>
> Hi Reid,
>
> I don't have anymore this loading problem.
> I solved by changing the Cassandra Driver Configuration.
> Now my cluster is pretty stable and I don't have machines with crazy CPU
> Load.
> The only thing not urgent but I need to investigate is the number of
> ESTABLISHED TCP connections. I see just one node having 7K TCP connections
> ESTABLISHED while the others are having around 4-6K connection opened. So
> the newest nodes added into the cluster have a higher number of ESTABLISHED
> TCP connections.
>
> default['cassandra']['sysctl'] = {
> 'net.ipv4.tcp_keepalive_time' => 60,
> 'net.ipv4.tcp_keepalive_probes' => 3

***UNCHECKED*** Re: Memory Recommendations for G1GC

2019-11-01 Thread Sergio
Hi Ben,

Well, I had a similar question and Jon Haddad was preferring ParNew + CMS
over G1GC for java 8.
https://lists.apache.org/thread.html/283547619b1dcdcddb80947a45e2178158394e317f3092b8959ba879@%3Cuser.cassandra.apache.org%3E
It depends on your JVM and in any case, I would test it based on your
workload.

What's your experience of running Cassandra in k8s. Are you using the
Cassandra Kubernetes Operator?

How do you monitor it and how do you perform disaster recovery backup?


Best,

Sergio

Il giorno ven 1 nov 2019 alle ore 14:14 Ben Mills  ha
scritto:

> Thanks Sergio - that's good advice and we have this built into the plan.
> Have you heard a solid/consistent recommendation/requirement as to the
> amount of memory heap requires for G1GC?
>
> On Fri, Nov 1, 2019 at 5:11 PM Sergio  wrote:
>
>> In any case I would test with tlp-stress or Cassandra stress tool any
>> configuration
>>
>> Sergio
>>
>> On Fri, Nov 1, 2019, 12:31 PM Ben Mills  wrote:
>>
>>> Greetings,
>>>
>>> We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering a
>>> change to the GC config.
>>>
>>> What is the minimum amount of memory that needs to be allocated to heap
>>> space when using G1GC?
>>>
>>> For GC, we currently use CMS. Along with the version upgrade, we'll be
>>> running the stateful set of Cassandra pods on new machine types in a new
>>> node pool with 12Gi memory per node. Not a lot of memory but an
>>> improvement. We may be able to go up to 16Gi memory per node. We'd like to
>>> continue using these heap settings:
>>>
>>> -XX:+UnlockExperimentalVMOptions
>>> -XX:+UseCGroupMemoryLimitForHeap
>>> -XX:MaxRAMFraction=2
>>>
>>> which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half of
>>> total available).
>>>
>>> Here are some details on the environment and configs in the event that
>>> something is relevant.
>>>
>>> Environment: Kubernetes
>>> Environment Config: Stateful set of 3 replicas
>>> Storage: Persistent Volumes
>>> Storage Class: SSD
>>> Node OS: Container-Optimized OS
>>> Container OS: Ubuntu 16.04.3 LTS
>>> Data Centers: 1
>>> Racks: 3 (one per zone)
>>> Nodes: 3
>>> Tokens: 4
>>> Replication Factor: 3
>>> Replication Strategy: NetworkTopologyStrategy (all keyspaces)
>>> Compaction Strategy: STCS (all tables)
>>> Read/Write Requirements: Blend of both
>>> Data Load: <1GB per node
>>> gc_grace_seconds: default (10 days - all tables)
>>>
>>> GC Settings: (CMS)
>>>
>>> -XX:+UseParNewGC
>>> -XX:+UseConcMarkSweepGC
>>> -XX:+CMSParallelRemarkEnabled
>>> -XX:SurvivorRatio=8
>>> -XX:MaxTenuringThreshold=1
>>> -XX:CMSInitiatingOccupancyFraction=75
>>> -XX:+UseCMSInitiatingOccupancyOnly
>>> -XX:CMSWaitDuration=3
>>> -XX:+CMSParallelInitialMarkEnabled
>>> -XX:+CMSEdenChunksRecordAlways
>>>
>>> Any ideas are much appreciated.
>>>
>>


Re: Memory Recommendations for G1GC

2019-11-01 Thread Sergio
In any case I would test with tlp-stress or Cassandra stress tool any
configuration

Sergio

On Fri, Nov 1, 2019, 12:31 PM Ben Mills  wrote:

> Greetings,
>
> We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering a
> change to the GC config.
>
> What is the minimum amount of memory that needs to be allocated to heap
> space when using G1GC?
>
> For GC, we currently use CMS. Along with the version upgrade, we'll be
> running the stateful set of Cassandra pods on new machine types in a new
> node pool with 12Gi memory per node. Not a lot of memory but an
> improvement. We may be able to go up to 16Gi memory per node. We'd like to
> continue using these heap settings:
>
> -XX:+UnlockExperimentalVMOptions
> -XX:+UseCGroupMemoryLimitForHeap
> -XX:MaxRAMFraction=2
>
> which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half of
> total available).
>
> Here are some details on the environment and configs in the event that
> something is relevant.
>
> Environment: Kubernetes
> Environment Config: Stateful set of 3 replicas
> Storage: Persistent Volumes
> Storage Class: SSD
> Node OS: Container-Optimized OS
> Container OS: Ubuntu 16.04.3 LTS
> Data Centers: 1
> Racks: 3 (one per zone)
> Nodes: 3
> Tokens: 4
> Replication Factor: 3
> Replication Strategy: NetworkTopologyStrategy (all keyspaces)
> Compaction Strategy: STCS (all tables)
> Read/Write Requirements: Blend of both
> Data Load: <1GB per node
> gc_grace_seconds: default (10 days - all tables)
>
> GC Settings: (CMS)
>
> -XX:+UseParNewGC
> -XX:+UseConcMarkSweepGC
> -XX:+CMSParallelRemarkEnabled
> -XX:SurvivorRatio=8
> -XX:MaxTenuringThreshold=1
> -XX:CMSInitiatingOccupancyFraction=75
> -XX:+UseCMSInitiatingOccupancyOnly
> -XX:CMSWaitDuration=3
> -XX:+CMSParallelInitialMarkEnabled
> -XX:+CMSEdenChunksRecordAlways
>
> Any ideas are much appreciated.
>


Re: Cassandra 4 alpha/alpha2

2019-10-31 Thread Sergio
OOO but still relevant:
Would not it be possible to create an Amazon AMI that has all the OS and
JVM settings in the right place and from there each developer can tweak the
things that need to be adjusted?
Best,
Sergio

Il giorno gio 31 ott 2019 alle ore 12:56 Abdul Patel 
ha scritto:

> Looks like i am messing up or missing something ..will revisit again
>
> On Thursday, October 31, 2019, Stefan Miklosovic <
> stefan.mikloso...@instaclustr.com> wrote:
>
>> Hi,
>>
>> I have tested both alpha and alpha2 and 3.11.5 on Centos 7.7.1908 and
>> all went fine (I have some custom images for my own purposes).
>>
>> Update between alpha and alpha2 was just about mere version bump.
>>
>> Cheers
>>
>> On Thu, 31 Oct 2019 at 20:40, Abdul Patel  wrote:
>> >
>> > Hey Everyone
>> >
>> > Did anyone was successfull to install either alpha or alpha2 version
>> for cassandra 4.0?
>> > Found 2 issues :
>> > 1> cassandra-env.sh:
>> > JAVA_VERSION varianle is not defined.
>> > Jvm-server.options file is not defined.
>> >
>> > This is fixable and after adding those , the error for cassandra-env.sh
>> errora went away.
>> >
>> > 2> second and major issue the cassandea binary when i try to start says
>> syntax error.
>> >
>> > /bin/cassandea: line 198:exec: : not found.
>> >
>> > Anyone has any idea on second issue?
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>


Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-30 Thread Sergio
Hi Reid,

I don't have anymore this loading problem.
I solved by changing the Cassandra Driver Configuration.
Now my cluster is pretty stable and I don't have machines with crazy CPU
Load.
The only thing not urgent but I need to investigate is the number of
ESTABLISHED TCP connections. I see just one node having 7K TCP connections
ESTABLISHED while the others are having around 4-6K connection opened. So
the newest nodes added into the cluster have a higher number of ESTABLISHED
TCP connections.

default['cassandra']['sysctl'] = {
'net.ipv4.tcp_keepalive_time' => 60,
'net.ipv4.tcp_keepalive_probes' => 3,
'net.ipv4.tcp_keepalive_intvl' => 10,
'net.core.rmem_max' => 16777216,
'net.core.wmem_max' => 16777216,
'net.core.rmem_default' => 16777216,
'net.core.wmem_default' => 16777216,
'net.core.optmem_max' => 40960,
'net.ipv4.tcp_rmem' => '4096 87380 16777216',
'net.ipv4.tcp_wmem' => '4096 65536 16777216',
'net.ipv4.ip_local_port_range' => '1 65535',
'net.ipv4.tcp_window_scaling' => 1,
  'net.core.netdev_max_backlog' => 2500,
  'net.core.somaxconn' => 65000,
'vm.max_map_count' => 1048575,
'vm.swappiness' => 0
}

These are my tweaked value and I used the values recommended from datastax.

Do you have something different?

Best,
Sergio

Il giorno mer 30 ott 2019 alle ore 13:27 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Oh nvm, didn't see the later msg about just posting what your fix was.
>
> R
>
>
> On 10/30/19, 4:24 PM, "Reid Pinchback" 
> wrote:
>
>  Message from External Sender
>
> Hi Sergio,
>
> Assuming nobody is actually mounting a SYN flood attack, then this
> sounds like you're either being hammered with connection requests in very
> short periods of time, or your TCP backlog tuning is off.   At least,
> that's where I'd start looking.  If you take that log message and google it
> (Possible SYN flooding... Sending cookies") you'll find explanations.  Or
> just googling "TCP backlog tuning".
>
> R
>
>
> On 10/30/19, 3:29 PM, "Sergio Bilello" 
> wrote:
>
> >
> >Oct 17 00:23:03 prod-personalization-live-data-cassandra-08
> kernel: TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending
> cookies. Check SNMP counters.
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>


Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-30 Thread Sergio Bilello
https://docs.datastax.com/en/drivers/java/2.2/com/datastax/driver/core/policies/LatencyAwarePolicy.html
 I had to change the Policy in the Cassandra Driver. I solved this problem few 
weeks ago. I am just posting the solution for anyone that could hit the same 
issue.
Best,
Sergio

On 2019/10/17 02:46:01, Sergio Bilello  wrote: 
> Hello guys!
> 
> I performed a thread dump 
> https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMTAvMTcvLS1kdW1wLnR4dC0tMC0zMC00MA==;
>  while try to join the node with
> 
> -Dcassandra.join_ring=false
> 
> OR
> -Dcassandra.join.ring=false
> 
> OR
> 
> -Djoin.ring=false
> 
> because the node spiked in load and latency was affecting the clients.
> 
> With or without that flag the node is high in latency and I see the load sky 
> rocketing when the number of TCP established connections increases
> 
> Analyzing the /var/log/messages I am able to read
> 
> Oct 17 00:23:39 prod-personalization-live-data-cassandra-08 cassandra: INFO 
> [Service Thread] 2019-10-17 00:23:39,030 GCInspector.java:284 - G1 Young 
> Generation GC in 255ms. G1 Eden Space: 361758720 -> 0; G1 Old Gen: 1855455944 
> -> 1781007048; G1 Survivor Space: 39845888 -> 32505856;
> 
> Oct 17 00:23:40 prod-personalization-live-data-cassandra-08 cassandra: INFO 
> [ScheduledTasks:1] 2019-10-17 00:23:40,352 NoSpamLogger.java:91 - Some 
> operations were slow, details available at debug level (debug.log)
> 
> 
> Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: TCP: 
> request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. Check 
> SNMP counters.
> 
> I don't see anything on debug.log that looks to be relevant
> 
> The machine is on aws with 4 cpu with 32GB Ram and 1 TB SSD i3.xlarge
> 
> 
> 
> 
> 
> [sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$ nodetool 
> tpstats
> 
> Pool Name Active Pending Completed Blocked All time blocked
> 
> ReadStage 32 53 559304 0 0
> 
> MiscStage 0 0 0 0 0
> 
> CompactionExecutor 1 107 118 0 0
> 
> MutationStage 0 0 2695 0 0
> 
> MemtableReclaimMemory 0 0 11 0 0
> 
> PendingRangeCalculator 0 0 33 0 0
> 
> GossipStage 0 0 4314 0 0
> 
> SecondaryIndexManagement 0 0 0 0 0
> 
> HintsDispatcher 0 0 0 0 0
> 
> RequestResponseStage 0 0 421865 0 0
> 
> Native-Transport-Requests 22 0 1903400 0 0
> 
> ReadRepairStage 0 0 59078 0 0
> 
> CounterMutationStage 0 0 0 0 0
> 
> MigrationStage 0 0 0 0 0
> 
> MemtablePostFlush 0 0 32 0 0
> 
> PerDiskMemtableFlushWriter_0 0 0 11 0 0
> 
> ValidationExecutor 0 0 0 0 0
> 
> Sampler 0 0 0 0 0
> 
> MemtableFlushWriter 0 0 11 0 0
> 
> InternalResponseStage 0 0 0 0 0
> 
> ViewMutationStage 0 0 0 0 0
> 
> AntiEntropyStage 0 0 0 0 0
> 
> CacheCleanupExecutor 0 0 0 0 0
> 
> 
> 
> Message type Dropped
> 
> READ 0
> 
> RANGE_SLICE 0
> 
> _TRACE 0
> 
> HINT 0
> 
> MUTATION 0
> 
> COUNTER_MUTATION 0
> 
> BATCH_STORE 0
> 
> BATCH_REMOVE 0
> 
> REQUEST_RESPONSE 0
> 
> PAGED_RANGE 0
> 
> READ_REPAIR 0
> 
> [sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$
> 
> 
> 
> 
> 
> top - 01:44:15 up 2 days, 1:45, 4 users, load average: 34.45, 27.71, 15.37
> 
> Tasks: 140 total, 1 running, 74 sleeping, 0 stopped, 0 zombie
> 
> %Cpu(s): 90.0 us, 4.5 sy, 3.0 ni, 1.1 id, 0.0 wa, 0.0 hi, 1.4 si, 0.0 st
> 
> KiB Mem : 31391772 total, 250504 free, 10880364 used, 20260904 buff/cache
> 
> KiB Swap: 0 total, 0 free, 0 used. 19341960 avail Mem
> 
> 
> 
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 
> 20712 cassand+ 20 0 194.1g 14.4g 4.6g S 392.0 48.2 74:50.48 java
> 
> 20823 sergio.+ 20 0 124856 6304 3136 S 1.7 0.0 0:13.51 htop
> 
> 7865 root 20 0 1062684 39880 11428 S 0.7 0.1 4:06.02 ir_agent
> 
> 3557 consul 20 0 41568 30192 18832 S 0.3 0.1 13:16.37 consul
> 
> 7600 root 20 0 2082700 46624 11880 S 0.3 0.1 4:14.60 ir_agent
> 
> 1 root 20 0 193660 7740 5220 S 0.0 0.0 0:56.36 systemd
> 
> 2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd
> 
> 4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H
> 
> 6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq
> 
> 7 root 20 0 0 0 0 S 0.0 0.0 0:06.04 ksoftirqd/0
> 
> 
> 
> 
> 
> [sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$ free
> 
> total used free shared buff/cache available
> 
> Mem: 31391772 10880916 256732 426552 20254124 19341768
> 
> Swap: 0 0 0
> 
> [sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$
> 
> 
> 
> 
> 
> 
> 
> bash-4.2$ java -jar sjk.jar ttop -p 20712
> 
> Monitoring threads ...
>

Re: Keyspace Clone in Existing Cluster

2019-10-29 Thread Sergio Bilello
Rolling bounce = Rolling repair per node? Would not it be easy to be scheduled 
with Cassandra Reaper?
On 2019/10/29 15:35:42, Paul Carlucci  wrote: 
> Copy the schema from your source keyspace to your new target keyspace,
> nodetool snapshot on your source keyspace, copy the SSTable files over, do
> a rolling bounce, repair, enjoy.  In my experience a rolling bounce is
> easier than a nodetool refresh.
> 
> It's either that or just copy it with Spark.
> 
> On Tue, Oct 29, 2019, 11:19 AM Ankit Gadhiya  wrote:
> 
> > Thanks Alex. So How do I copy SSTables from 1.0 to 2.0? (Same
> > SSTableLoader or any other approach?)
> > Also since I've multi-node cluster - I'll have to do this on every single
> > node - is there any tool or better way to execute this just from a single
> > node?
> >
> > *Thanks & Regards,*
> > *Ankit Gadhiya*
> >
> >
> >
> > On Tue, Oct 29, 2019 at 11:16 AM Alex Ott  wrote:
> >
> >> You can create all tables in new keyspace, copy SSTables from 1.0 to 2.0
> >> tables & use nodetool refresh on tables in KS 2.0 to say Cassandra about
> >> them.
> >>
> >> On Tue, Oct 29, 2019 at 4:10 PM Ankit Gadhiya 
> >> wrote:
> >>
> >>> Hello Folks,
> >>>
> >>> Greetings!.
> >>>
> >>> I've a requirement in my project to setup Blue-Green deployment for
> >>> Cassandra. E.x. Say My current active schema (application pointing to) is
> >>> Keyspace V1.0 and for my next release I want to setup Keysapce 2.0 (with
> >>> some structural changes) and all testing/validation would happen on it and
> >>> once successful , App would switch connection to keyspace 2.0 - This would
> >>> be generic release deployment for our project.
> >>>
> >>> One of the approach we thought of would be to Create keyspace 2.0 as
> >>> clone from Keyspace 1.0 including data using sstableloader but this would
> >>> be time consuming, also being a multi-node cluster (6+6 in each DC) - it
> >>> wouldn't be very feasible to do this manually on all the nodes for 
> >>> multiple
> >>> tables part of that keyspace. Was wondering if we have any other creative
> >>> way to suffice this requirement.
> >>>
> >>> Appreciate your time on this.
> >>>
> >>>
> >>> *Thanks & Regards,*
> >>> *Ankit Gadhiya*
> >>>
> >>>
> >>
> >> --
> >> With best wishes,Alex Ott
> >> http://alexott.net/
> >> Twitter: alexott_en (English), alexott (Russian)
> >>
> >
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Default values for dclocal_read_repair_chance = 0.1 and read_repair_chance = 0 Should I set both to 0?

2019-10-26 Thread Sergio
I have a COLUMN Family in a Keyspace with Replication Factor = 3.
The client reads it with LOCAL_QUORUM. Does this mean that all the reads
should kick a read_repair or not?
Are these parameters meaningful only with LOCAL_ONE or ONE Consistency then?

I have also an application that translates some data in SSTABLE format and
prepares the data to be streamed to the cluster with the SSTABLELOADER.
This operation is done to UPDATE the mentioned COLUMN FAMILY.
Can I avoid to repair the COLUMN FAMILY since the clients are using the
LOCAL_QUORUM Consistency Level?
If I use LOCAL_ONE should I repair the table with the REAPER or can I avoid
to repair it if I have all the nodes up and running?
There is no concern to read the most updated data and I believe that it
should be really unlikely that it is going to happen so even with LOCAL_ONE
I should not have concerns and avoid to perform REPAIR with REAPER.
I would like to achieve consistency and possibly avoiding to perform an
expensive repair cycle with REAPER.

What do you think about it?

Reference:
https://stackoverflow.com/questions/33240674/reparing-inconsistency-when-read-repair-chance-0
https://www.slideshare.net/DataStax/real-world-tales-of-repair-alexander-dejanovski-the-last-pickle-cassandra-summit-2016
SLIDE 85. Not repair everything.

Thanks everyone!

Have a great weekend!


Re: Decommissioned Node UNREACHABLE in describecluster but LEFT in gossipinfo

2019-10-26 Thread Sergio Bilello
It disappeared from describecluster after 1 day. It is only in gossipinfo now 
and this looks to be ok :)

On 2019/10/25 04:01:03, Sergio  wrote: 
> Hi guys,
> 
> Cassandra 3.11.4
> 
> nodetool gossipinfo
> /10.1.20.49
>   generation:1571694191
>   heartbeat:279800
>   STATUS:279798:LEFT,-1013739435631815991,1572225050446
>   LOAD:279791:3.4105213781E11
>   SCHEMA:12:5cad59d2-c3d0-3a12-ad10-7578d225b082
>   DC:8:live
>   RACK:10:us-east-1a
>   RELEASE_VERSION:4:3.11.4
>   INTERNAL_IP:6:10.1.20.49
>   RPC_ADDRESS:3:10.1.20.49
>   NET_VERSION:1:11
>   HOST_ID:2:be5a0193-56e7-4d42-8cc8-5d2141ab4872
>   RPC_READY:29:true
>   TOKENS:15:
> 
> The node is not shown in nodetool status
> 
> and it is displayed as UNREACHABLE in nodetool describecluster
> 
> I found this old conversation
> https://grokbase.com/t/cassandra/user/162gwp6pz6/decommissioned-nodes-shows-up-in-nodetool-describecluster-as-unreachable-in-2-1-12-version
> 
> Is there something that I should do to fix this?
> 
> Best,
> 
> Sergio
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Decommissioned Node UNREACHABLE in describecluster but LEFT in gossipinfo

2019-10-24 Thread Sergio
Hi guys,

Cassandra 3.11.4

nodetool gossipinfo
/10.1.20.49
  generation:1571694191
  heartbeat:279800
  STATUS:279798:LEFT,-1013739435631815991,1572225050446
  LOAD:279791:3.4105213781E11
  SCHEMA:12:5cad59d2-c3d0-3a12-ad10-7578d225b082
  DC:8:live
  RACK:10:us-east-1a
  RELEASE_VERSION:4:3.11.4
  INTERNAL_IP:6:10.1.20.49
  RPC_ADDRESS:3:10.1.20.49
  NET_VERSION:1:11
  HOST_ID:2:be5a0193-56e7-4d42-8cc8-5d2141ab4872
  RPC_READY:29:true
  TOKENS:15:

The node is not shown in nodetool status

and it is displayed as UNREACHABLE in nodetool describecluster

I found this old conversation
https://grokbase.com/t/cassandra/user/162gwp6pz6/decommissioned-nodes-shows-up-in-nodetool-describecluster-as-unreachable-in-2-1-12-version

Is there something that I should do to fix this?

Best,

Sergio


Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-24 Thread Sergio
Thanks Reid!

I agree with all the things that you said!

Best,
Sergio

Il giorno gio 24 ott 2019 alle ore 09:25 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Two different AWS AZs are in two different physical locations.  Typically
> different cities.  Which means that you’re trying to manage the risk of an
> AZ going dark, so you use more than one AZ just in case.  The downside is
> that you will have some degree of network performance difference between
> AZs because of whatever WAN pipe AWS owns/leased to connect between them.
>
>
>
> Having a DC in one AZ is easy to reason about.  The AZ is there, or it is
> not.  If you have two DCs in your cluster, and you lose an AZ, it means you
> still have a functioning cluster with one DC and you still have quorum.
> Yay, even in an outage, you know you can still do business.  You would only
> have to route any traffic normally sent to the other DC to the remaining
> one, so as long as there is resource headroom planning in how you provision
> your hardware, you’re in a safe state.
>
>
>
> If you start splitting a DC across AZs without using racks to organize
> nodes on a per-AZ basis, off the top of my head I don’t know how you reason
> about your risks for losing quorum without pausing to really think through
> vnodes and token distribution and whatnot.  I’m not a fan of topologies I
> can’t reason about when paged at 3 in the morning and I’m half asleep.  I
> prefer simple until the workload motivates complex.
>
>
>
> R
>
>
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Thursday, October 24, 2019 at 12:06 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Cassandra Rack - Datacenter Load Balancing relations
>
>
>
> *Message from External Sender*
>
> Thanks Reid and Jon!
>
>
>
> Yes I will stick with one rack per DC for sure and I will look at the
> Vnodes problem later on.
>
>
>
>
>
> What's the difference in terms of reliability between
>
> A) spreading 2 Datacenters across 3 AZ
>
> B) having 2 Datacenters in 2 separate AZ
>
> ?
>
>
>
>
>
> Best,
>
>
>
> Sergio
>
>
>
> On Thu, Oct 24, 2019, 7:36 AM Reid Pinchback 
> wrote:
>
> Hey Sergio,
>
>
>
> Forgive but I’m at work and had to skim the info quickly.
>
>
>
> When in doubt, simplify.  So 1 rack per DC.  Distributed systems get
> rapidly harder to reason about the more complicated you make them.  There’s
> more than enough to learn about C* without jumping into the complexity too
> soon.
>
>
>
> To deal with the unbalancing issue, pay attention to Jon Haddad’s advice
> on vnode count and how to fairly distribute tokens with a small vnode
> count.  I’d rather point you to his information, as I haven’t dug into
> vnode counts and token distribution in detail; he’s got a lot more time in
> C* than I do.  I come at this more as a traditional RDBMS and Java guy who
> has slowly gotten up to speed on C* over the last few years, and dealt with
> DynamoDB a lot so have lived with a lot of similarity in data modelling
> concerns.  Detailed internals I only know in cases where I had reason to
> dig into C* source.
>
>
>
> There are so many knobs to turn in C* that it can be very easy to
> overthink things.  Simplify where you can.  Remove GC pressure wherever you
> can.  Negotiate with your consumers to have data models that make sense for
> C*.  If you have those three criteria foremost in mind, you’ll likely be
> fine for quite some time.  And in the times where something isn’t going
> well, simpler is easier to investigate.
>
>
> R
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Wednesday, October 23, 2019 at 3:34 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Cassandra Rack - Datacenter Load Balancing relations
>
>
>
> *Message from External Sender*
>
> Hi Reid,
>
> Thank you very much for clearing these concepts for me.
> https://community.datastax.com/comments/1133/view.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_comments_1133_view.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=hcKr__B8MyXvYx8vQx20B_KN89ZynwB-N4px87tcYY8=RSwuSea6HjOb3gChVS_i4GnKgl--H0q-VHz38_setfc=>
> I posted this question on the datastax forum regarding our cluster that it
> is unbalanced and the reply was related that the *number of racks should
> be a multiplier of the replication factor *in order to be balanced or 1.
> I thought then if I have 3 availability zones I should have 3 racks for
> each datacent

Re: Repair Issues

2019-10-24 Thread Sergio
Are you using Cassandra reaper?

On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:

> Greetings,
>
> Inherited a small Cassandra cluster with some repair issues and need some
> advice on recommended next steps. Apologies in advance for a long email.
>
> Issue:
>
> Intermittent repair failures on two non-system keyspaces.
>
> - platform_users
> - platform_management
>
> Repair Type:
>
> Full, parallel repairs are run on each of the three nodes every five days.
>
> Repair command output for a typical failure:
>
> [2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace
> platform_users with repair options (parallelism: parallel, primary range:
> false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters:
> [], hosts: [], # of ranges: 12)
> [2019-10-18 00:22:09,242] Repair session
> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
> [(-1890954128429545684,2847510199483651721],
> (8249813014782655320,-8746483007209345011],
> (4299912178579297893,6811748355903297393],
> (-8746483007209345011,-8628999431140554276],
> (-5865769407232506956,-4746990901966533744],
> (-4470950459111056725,-1890954128429545684],
> (4001531392883953257,4299912178579297893],
> (6811748355903297393,6878104809564599690],
> (6878104809564599690,8249813014782655320],
> (-4746990901966533744,-4470950459111056725],
> (-8628999431140554276,-5865769407232506956],
> (2847510199483651721,4001531392883953257]] failed with error [repair
> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
> [(-1890954128429545684,2847510199483651721],
> (8249813014782655320,-8746483007209345011],
> (4299912178579297893,6811748355903297393],
> (-8746483007209345011,-8628999431140554276],
> (-5865769407232506956,-4746990901966533744],
> (-4470950459111056725,-1890954128429545684],
> (4001531392883953257,4299912178579297893],
> (6811748355903297393,6878104809564599690],
> (6878104809564599690,8249813014782655320],
> (-4746990901966533744,-4470950459111056725],
> (-8628999431140554276,-5865769407232506956],
> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
> (progress: 26%)
> [2019-10-18 00:22:09,246] Some repair failed
> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds
>
> Additional Notes:
>
> Repairs encounter above failures more often than not. Sometimes on one
> node only, though occasionally on two. Sometimes just one of the two
> keyspaces, sometimes both. Apparently the previous repair schedule for
> this cluster included incremental repairs (script alternated between
> incremental and full repairs). After reading this TLP article:
>
>
> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
>
> the repair script was replaced with cassandra-reaper (v1.4.0), which was
> run with its default configs. Reaper was fine but only obscured the ongoing
> issues (it did not resolve them) and complicated the debugging process and
> so was then removed. The current repair schedule is as described above
> under Repair Type.
>
> Attempts at Resolution:
>
> (1) nodetool scrub was attempted on the offending keyspaces/tables to no
> effect.
>
> (2) sstablescrub has not been attempted due to the current design of the
> Docker image that runs Cassandra in each Kubernetes pod - i.e. there is no
> way to stop the server to run this utility without killing the only pid
> running in the container.
>
> Related Error:
>
> Not sure if this is related, though sometimes, when either:
>
> (a) Running nodetool snapshot, or
> (b) Rolling a pod that runs a Cassandra node, which calls nodetool drain
> prior shutdown,
>
> the following error is thrown:
>
> -- StackTrace --
> java.lang.RuntimeException: Last written key
> DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda,
> 10df3ba16eb24c8ebdddc0c7af586bda) >= current key
> DecoratedKey(----,
> 17343121887f480c9ba87c0e32206b74) writing into
> /cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db
> at
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:114)
> at
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153)
> at
> org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:441)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:477)
> at
> org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:363)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at 

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-24 Thread Sergio
Thanks Reid and Jon!

Yes I will stick with one rack per DC for sure and I will look at the
Vnodes problem later on.


What's the difference in terms of reliability between
A) spreading 2 Datacenters across 3 AZ
B) having 2 Datacenters in 2 separate AZ
?


Best,

Sergio

On Thu, Oct 24, 2019, 7:36 AM Reid Pinchback 
wrote:

> Hey Sergio,
>
>
>
> Forgive but I’m at work and had to skim the info quickly.
>
>
>
> When in doubt, simplify.  So 1 rack per DC.  Distributed systems get
> rapidly harder to reason about the more complicated you make them.  There’s
> more than enough to learn about C* without jumping into the complexity too
> soon.
>
>
>
> To deal with the unbalancing issue, pay attention to Jon Haddad’s advice
> on vnode count and how to fairly distribute tokens with a small vnode
> count.  I’d rather point you to his information, as I haven’t dug into
> vnode counts and token distribution in detail; he’s got a lot more time in
> C* than I do.  I come at this more as a traditional RDBMS and Java guy who
> has slowly gotten up to speed on C* over the last few years, and dealt with
> DynamoDB a lot so have lived with a lot of similarity in data modelling
> concerns.  Detailed internals I only know in cases where I had reason to
> dig into C* source.
>
>
>
> There are so many knobs to turn in C* that it can be very easy to
> overthink things.  Simplify where you can.  Remove GC pressure wherever you
> can.  Negotiate with your consumers to have data models that make sense for
> C*.  If you have those three criteria foremost in mind, you’ll likely be
> fine for quite some time.  And in the times where something isn’t going
> well, simpler is easier to investigate.
>
>
> R
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Wednesday, October 23, 2019 at 3:34 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: Cassandra Rack - Datacenter Load Balancing relations
>
>
>
> *Message from External Sender*
>
> Hi Reid,
>
> Thank you very much for clearing these concepts for me.
> https://community.datastax.com/comments/1133/view.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_comments_1133_view.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=hcKr__B8MyXvYx8vQx20B_KN89ZynwB-N4px87tcYY8=RSwuSea6HjOb3gChVS_i4GnKgl--H0q-VHz38_setfc=>
> I posted this question on the datastax forum regarding our cluster that it
> is unbalanced and the reply was related that the *number of racks should
> be a multiplier of the replication factor *in order to be balanced or 1.
> I thought then if I have 3 availability zones I should have 3 racks for
> each datacenter and not 2 (us-east-1b, us-east-1a) as I have right now or
> in the easiest way, I should have a rack for each datacenter.
>
>
>
> 1.  Datacenter: live
> 
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address  Load   Tokens   OwnsHost ID
> Rack
> UN  10.1.20.49   289.75 GiB  256  ?
> be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
> UN  10.1.30.112  103.03 GiB  256  ?
> e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
> UN  10.1.19.163  129.61 GiB  256  ?
> 3c2efdda-8dd4-4f08-b991-9aff062a5388  us-east-1a
> UN  10.1.26.181  145.28 GiB  256  ?
> 0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
> UN  10.1.17.213  149.04 GiB  256  ?
> 71563e86-b2ae-4d2c-91c5-49aa08386f67  us-east-1a
> DN  10.1.19.198  52.41 GiB  256  ?
> 613b43c0-0688-4b86-994c-dc772b6fb8d2  us-east-1b
> UN  10.1.31.60   195.17 GiB  256  ?
> 3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
> UN  10.1.25.206  100.67 GiB  256  ?
> f43532ad-7d2e-4480-a9ce-2529b47f823d  us-east-1b
> So each rack label right now matches the availability zone and we have 3
> Datacenters and 2 Availability Zone with 2 racks per DC but the above is
> clearly unbalanced
> If I have a keyspace with a replication factor = 3 and I want to minimize
> the number of nodes to scale up and down the cluster and keep it balanced
> should I consider an approach like OPTION A)
>
> 2.  Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>
> 3.  3 read ONE us-east-1a
>
> 4.  4 write ONE us-east-1b 5 write ONE us-east-1b
>
> 5.  6 write ONE us-east-1b
>
> 6.  OPTION B)
>
> 7.  Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>
> 8.  3 read ONE us-east-1a
>
> 9.  4 write TWO us-east-1b 5 write TWO us-east-1b
>
> 10.6 write TWO us-east-1b
>
> 11.*7 read ONE us-east-1c 8 write TWO us-east-1c*
>
> 12.*9 read ONE us-east-1c* Optio

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Sergio
Thanks, Jon!

I just added the AZ for each rack on the right column.
However thanks for your reply and clarification.
Maybe I should have marked the rack names with RACK-READ and RACK-WRITE to
avoid confusion and not use ONE and TWO.

What's more, fault-tolerant between with RF=3:

A) spread each DC across 3 AZ
B) assign to each DC a separate AZ

I assume that I should adjust the consistency level accordingly in case of
failures:
If I have 3 nodes and 1 goes down with RF = 3 and LOCAL_QUORUM consistency
I should downgrade to LOCAL_ONE if I want to keep serving traffic for reads.

Best,

Sergio





Il giorno mer 23 ott 2019 alle ore 14:12 Jon Haddad  ha
scritto:

> Oh, my bad.  There was a flood of information there, I didn't realize you
> had switched to two DCs.  It's been a long day.
>
> I'll be honest, it's really hard to read your various options as you've
> intermixed terminology from AWS and Cassandra in a weird way and there's
> several pages of information here to go through.  I don't have time to
> decipher it, sorry.
>
> Spread a DC across 3 AZs if you want to be fault tolerant and will use
> RF=3, use a single AZ if you don't care about full DC failure in the case
> of an AZ failure or you're not using RF=3.
>
>
> On Wed, Oct 23, 2019 at 4:56 PM Sergio  wrote:
>
>> OPTION C or OPTION A?
>>
>> Which one are you referring to?
>>
>> Both have separate DCs to keep the workload separate.
>>
>>- OPTION A)
>>- Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>>- 3 read ONE us-east-1a
>>- 4 write TWO us-east-1b 5 write TWO us-east-1b
>>- 6 write TWO us-east-1b
>>
>>
>> Here we have 2 DC read and write
>> One Rack per DC
>> One Availability Zone per DC
>>
>> Thanks,
>>
>> Sergio
>>
>>
>> On Wed, Oct 23, 2019, 1:11 PM Jon Haddad  wrote:
>>
>>> Personally, I wouldn't ever do this.  I recommend separate DCs if you
>>> want to keep workloads separate.
>>>
>>> On Wed, Oct 23, 2019 at 4:06 PM Sergio 
>>> wrote:
>>>
>>>>   I forgot to comment for
>>>>
>>>>OPTION C)
>>>>1. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>>>>2. 3 read ONE us-east-1c
>>>>3. 4 write TWO us-east-1a 5 write TWO us-east-1b
>>>>4. 6 write TWO us-east-1c I would expect that I need to decrease
>>>>the Consistency Level in the reads if one of the AZ goes down. Please
>>>>consider the below one as the real OPTION A. The previous one looks to 
>>>> be
>>>>wrong because the same rack is assigned to 2 different DC.
>>>>5. OPTION A)
>>>>6. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>>>>7. 3 read ONE us-east-1a
>>>>8. 4 write TWO us-east-1b 5 write TWO us-east-1b
>>>>9. 6 write TWO us-east-1b
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Sergio
>>>>
>>>> Il giorno mer 23 ott 2019 alle ore 12:33 Sergio <
>>>> lapostadiser...@gmail.com> ha scritto:
>>>>
>>>>> Hi Reid,
>>>>>
>>>>> Thank you very much for clearing these concepts for me.
>>>>> https://community.datastax.com/comments/1133/view.html I posted this
>>>>> question on the datastax forum regarding our cluster that it is unbalanced
>>>>> and the reply was related that the *number of racks should be a
>>>>> multiplier of the replication factor *in order to be balanced or 1. I
>>>>> thought then if I have 3 availability zones I should have 3 racks for each
>>>>> datacenter and not 2 (us-east-1b, us-east-1a) as I have right now or in 
>>>>> the
>>>>> easiest way, I should have a rack for each datacenter.
>>>>>
>>>>>
>>>>>
>>>>>1. Datacenter: live
>>>>>
>>>>>Status=Up/Down
>>>>>|/ State=Normal/Leaving/Joining/Moving
>>>>>--  Address  Load   Tokens   OwnsHost ID
>>>>>Rack
>>>>>UN  10.1.20.49   289.75 GiB  256  ?
>>>>>be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
>>>>>UN  10.1.30.112  103.03 GiB  256  ?
>>>>>e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
>>>>>UN  10.1.19.163  129.61 GiB  256  ?
>>>>>3c2efdda-8dd4-4f08-b991-9

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Sergio
OPTION C or OPTION A?

Which one are you referring to?

Both have separate DCs to keep the workload separate.

   - OPTION A)
   - Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
   - 3 read ONE us-east-1a
   - 4 write TWO us-east-1b 5 write TWO us-east-1b
   - 6 write TWO us-east-1b


Here we have 2 DC read and write
One Rack per DC
One Availability Zone per DC

Thanks,

Sergio


On Wed, Oct 23, 2019, 1:11 PM Jon Haddad  wrote:

> Personally, I wouldn't ever do this.  I recommend separate DCs if you want
> to keep workloads separate.
>
> On Wed, Oct 23, 2019 at 4:06 PM Sergio  wrote:
>
>>   I forgot to comment for
>>
>>OPTION C)
>>1. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>>2. 3 read ONE us-east-1c
>>3. 4 write TWO us-east-1a 5 write TWO us-east-1b
>>4. 6 write TWO us-east-1c I would expect that I need to decrease the
>>Consistency Level in the reads if one of the AZ goes down. Please consider
>>the below one as the real OPTION A. The previous one looks to be wrong
>>because the same rack is assigned to 2 different DC.
>>5. OPTION A)
>>6. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>>7. 3 read ONE us-east-1a
>>8. 4 write TWO us-east-1b 5 write TWO us-east-1b
>>9. 6 write TWO us-east-1b
>>
>>
>>
>> Thanks,
>>
>> Sergio
>>
>> Il giorno mer 23 ott 2019 alle ore 12:33 Sergio <
>> lapostadiser...@gmail.com> ha scritto:
>>
>>> Hi Reid,
>>>
>>> Thank you very much for clearing these concepts for me.
>>> https://community.datastax.com/comments/1133/view.html I posted this
>>> question on the datastax forum regarding our cluster that it is unbalanced
>>> and the reply was related that the *number of racks should be a
>>> multiplier of the replication factor *in order to be balanced or 1. I
>>> thought then if I have 3 availability zones I should have 3 racks for each
>>> datacenter and not 2 (us-east-1b, us-east-1a) as I have right now or in the
>>> easiest way, I should have a rack for each datacenter.
>>>
>>>
>>>
>>>1. Datacenter: live
>>>
>>>Status=Up/Down
>>>|/ State=Normal/Leaving/Joining/Moving
>>>--  Address  Load   Tokens   OwnsHost ID
>>>  Rack
>>>UN  10.1.20.49   289.75 GiB  256  ?
>>>be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
>>>UN  10.1.30.112  103.03 GiB  256  ?
>>>e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
>>>UN  10.1.19.163  129.61 GiB  256  ?
>>>3c2efdda-8dd4-4f08-b991-9aff062a5388  us-east-1a
>>>UN  10.1.26.181  145.28 GiB  256  ?
>>>0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
>>>UN  10.1.17.213  149.04 GiB  256  ?
>>>71563e86-b2ae-4d2c-91c5-49aa08386f67  us-east-1a
>>>DN  10.1.19.198  52.41 GiB  256  ?
>>>613b43c0-0688-4b86-994c-dc772b6fb8d2  us-east-1b
>>>UN  10.1.31.60   195.17 GiB  256  ?
>>>3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
>>>UN  10.1.25.206  100.67 GiB  256  ?
>>>f43532ad-7d2e-4480-a9ce-2529b47f823d  us-east-1b
>>>So each rack label right now matches the availability zone and we
>>>have 3 Datacenters and 2 Availability Zone with 2 racks per DC but the
>>>above is clearly unbalanced
>>>If I have a keyspace with a replication factor = 3 and I want to
>>>minimize the number of nodes to scale up and down the cluster and keep it
>>>balanced should I consider an approach like OPTION A)
>>>2. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>>>3. 3 read ONE us-east-1a
>>>4. 4 write ONE us-east-1b 5 write ONE us-east-1b
>>>5. 6 write ONE us-east-1b
>>>6. OPTION B)
>>>7. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>>>8. 3 read ONE us-east-1a
>>>9. 4 write TWO us-east-1b 5 write TWO us-east-1b
>>>10. 6 write TWO us-east-1b
>>>11. *7 read ONE us-east-1c 8 write TWO us-east-1c*
>>>12. *9 read ONE us-east-1c* Option B looks to be unbalanced and I
>>>would exclude it OPTION C)
>>>13. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>>>14. 3 read ONE us-east-1c
>>>15. 4 write TWO us-east-1a 5 write TWO us-east-1b
>>>16. 6 write TWO us-east-1c
>>>17.
>>&g

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Sergio
  I forgot to comment for

   OPTION C)
   1. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
   2. 3 read ONE us-east-1c
   3. 4 write TWO us-east-1a 5 write TWO us-east-1b
   4. 6 write TWO us-east-1c I would expect that I need to decrease the
   Consistency Level in the reads if one of the AZ goes down. Please consider
   the below one as the real OPTION A. The previous one looks to be wrong
   because the same rack is assigned to 2 different DC.
   5. OPTION A)
   6. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
   7. 3 read ONE us-east-1a
   8. 4 write TWO us-east-1b 5 write TWO us-east-1b
   9. 6 write TWO us-east-1b



Thanks,

Sergio

Il giorno mer 23 ott 2019 alle ore 12:33 Sergio 
ha scritto:

> Hi Reid,
>
> Thank you very much for clearing these concepts for me.
> https://community.datastax.com/comments/1133/view.html I posted this
> question on the datastax forum regarding our cluster that it is unbalanced
> and the reply was related that the *number of racks should be a
> multiplier of the replication factor *in order to be balanced or 1. I
> thought then if I have 3 availability zones I should have 3 racks for each
> datacenter and not 2 (us-east-1b, us-east-1a) as I have right now or in the
> easiest way, I should have a rack for each datacenter.
>
>
>
>1. Datacenter: live
>
>Status=Up/Down
>|/ State=Normal/Leaving/Joining/Moving
>--  Address  Load   Tokens   OwnsHost ID
>Rack
>UN  10.1.20.49   289.75 GiB  256  ?
>be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
>UN  10.1.30.112  103.03 GiB  256  ?
>e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
>UN  10.1.19.163  129.61 GiB  256  ?
>3c2efdda-8dd4-4f08-b991-9aff062a5388  us-east-1a
>UN  10.1.26.181  145.28 GiB  256  ?
>0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
>UN  10.1.17.213  149.04 GiB  256  ?
>71563e86-b2ae-4d2c-91c5-49aa08386f67  us-east-1a
>DN  10.1.19.198  52.41 GiB  256  ?
>613b43c0-0688-4b86-994c-dc772b6fb8d2  us-east-1b
>UN  10.1.31.60   195.17 GiB  256  ?
>3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
>UN  10.1.25.206  100.67 GiB  256  ?
>f43532ad-7d2e-4480-a9ce-2529b47f823d  us-east-1b
>So each rack label right now matches the availability zone and we have
>3 Datacenters and 2 Availability Zone with 2 racks per DC but the above is
>clearly unbalanced
>If I have a keyspace with a replication factor = 3 and I want to
>minimize the number of nodes to scale up and down the cluster and keep it
>balanced should I consider an approach like OPTION A)
>2. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>3. 3 read ONE us-east-1a
>4. 4 write ONE us-east-1b 5 write ONE us-east-1b
>5. 6 write ONE us-east-1b
>6. OPTION B)
>7. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>8. 3 read ONE us-east-1a
>9. 4 write TWO us-east-1b 5 write TWO us-east-1b
>10. 6 write TWO us-east-1b
>11. *7 read ONE us-east-1c 8 write TWO us-east-1c*
>12. *9 read ONE us-east-1c* Option B looks to be unbalanced and I
>would exclude it OPTION C)
>13. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>14. 3 read ONE us-east-1c
>15. 4 write TWO us-east-1a 5 write TWO us-east-1b
>16. 6 write TWO us-east-1c
>17.
>
>
>so I am thinking of A if I have the restriction of 2 AZ but I guess
>that option C would be the best. If I have to add another DC for reads
>because we want to assign a new DC for each new microservice it would look
>like:
>   OPTION EXTRA DC For Reads
>   1. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>   2. 3 read ONE us-east-1c
>   3. 4 write TWO us-east-1a 5 write TWO us-east-1b
>   4. 6 write TWO us-east-1c 7 extra-read THREE us-east-1a
>   5. 8 extra-read THREE us-east-1b
>   6.
>  7.
>
>
>1. 9 extra-read THREE us-east-1c
>   2.
>The DC for *write* will replicate the data in the other datacenters.
>My scope is to keep the *read* machines dedicated to serve reads and
>*write* machines to serve writes. Cassandra will handle the
>replication for me. Is there any other option that is I missing or wrong
>assumption? I am thinking that I will write a blog post about all my
>learnings so far, thank you very much for the replies Best, Sergio
>
>
> Il giorno mer 23 ott 2019 alle ore 10:57 Reid Pinchback <
> rpinchb...@tripadvisor.com> ha scritto:
>
>> No, that’s not correct.  The point of racks is to help you distribute the
>

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Sergio
Hi Reid,

Thank you very much for clearing these concepts for me.
https://community.datastax.com/comments/1133/view.html I posted this
question on the datastax forum regarding our cluster that it is unbalanced
and the reply was related that the *number of racks should be a multiplier
of the replication factor *in order to be balanced or 1. I thought then if
I have 3 availability zones I should have 3 racks for each datacenter and
not 2 (us-east-1b, us-east-1a) as I have right now or in the easiest way, I
should have a rack for each datacenter.



   1. Datacenter: live
   
   Status=Up/Down
   |/ State=Normal/Leaving/Joining/Moving
   --  Address  Load   Tokens   OwnsHost ID
 Rack
   UN  10.1.20.49   289.75 GiB  256  ?
   be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
   UN  10.1.30.112  103.03 GiB  256  ?
   e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
   UN  10.1.19.163  129.61 GiB  256  ?
   3c2efdda-8dd4-4f08-b991-9aff062a5388  us-east-1a
   UN  10.1.26.181  145.28 GiB  256  ?
   0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
   UN  10.1.17.213  149.04 GiB  256  ?
   71563e86-b2ae-4d2c-91c5-49aa08386f67  us-east-1a
   DN  10.1.19.198  52.41 GiB  256  ?
   613b43c0-0688-4b86-994c-dc772b6fb8d2  us-east-1b
   UN  10.1.31.60   195.17 GiB  256  ?
   3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
   UN  10.1.25.206  100.67 GiB  256  ?
   f43532ad-7d2e-4480-a9ce-2529b47f823d  us-east-1b
   So each rack label right now matches the availability zone and we have 3
   Datacenters and 2 Availability Zone with 2 racks per DC but the above is
   clearly unbalanced
   If I have a keyspace with a replication factor = 3 and I want to
   minimize the number of nodes to scale up and down the cluster and keep it
   balanced should I consider an approach like OPTION A)
   2. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
   3. 3 read ONE us-east-1a
   4. 4 write ONE us-east-1b 5 write ONE us-east-1b
   5. 6 write ONE us-east-1b
   6. OPTION B)
   7. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
   8. 3 read ONE us-east-1a
   9. 4 write TWO us-east-1b 5 write TWO us-east-1b
   10. 6 write TWO us-east-1b
   11. *7 read ONE us-east-1c 8 write TWO us-east-1c*
   12. *9 read ONE us-east-1c* Option B looks to be unbalanced and I would
   exclude it OPTION C)
   13. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
   14. 3 read ONE us-east-1c
   15. 4 write TWO us-east-1a 5 write TWO us-east-1b
   16. 6 write TWO us-east-1c
   17.


   so I am thinking of A if I have the restriction of 2 AZ but I guess that
   option C would be the best. If I have to add another DC for reads because
   we want to assign a new DC for each new microservice it would look like:
  OPTION EXTRA DC For Reads
  1. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
  2. 3 read ONE us-east-1c
  3. 4 write TWO us-east-1a 5 write TWO us-east-1b
  4. 6 write TWO us-east-1c 7 extra-read THREE us-east-1a
  5. 8 extra-read THREE us-east-1b
  6.
 7.


   1. 9 extra-read THREE us-east-1c
  2.
   The DC for *write* will replicate the data in the other datacenters. My
   scope is to keep the *read* machines dedicated to serve reads and *write*
   machines to serve writes. Cassandra will handle the replication for me. Is
   there any other option that is I missing or wrong assumption? I am thinking
   that I will write a blog post about all my learnings so far, thank you very
   much for the replies Best, Sergio


Il giorno mer 23 ott 2019 alle ore 10:57 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> No, that’s not correct.  The point of racks is to help you distribute the
> replicas, not further-replicate the replicas.  Data centers are what do the
> latter.  So for example, if you wanted to be able to ensure that you always
> had quorum if an AZ went down, then you could have two DCs where one was in
> each AZ, and use one rack in each DC.  In your situation I think I’d be
> more tempted to consider that.  Then if an AZ went away, you could fail
> over your traffic to the remaining DC and still be perfectly fine.
>
>
>
> For background on replicas vs racks, I believe the information you want is
> under the heading ‘NetworkTopologyStrategy’ at:
>
> http://cassandra.apache.org/doc/latest/architecture/dynamo.html
>
>
>
> That should help you better understand how replicas distribute.
>
>
>
> As mentioned before, while you can choose to do the reads in one DC,
> except for concerns about contention related to network traffic and
> connection handling, you can’t isolate reads from writes.  You can _
> *mostly*_ insulate the write DC from the activity within the read DC, and
> even that isn’t an absolute because of repairs.  However, your mileage may
> vary, so

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Sergio
Hi Reid,

Thanks for your reply. I really appreciate your explanation.

We are in AWS and we are using right now 2 Availability Zone and not 3. We
found our cluster really unbalanced because the keyspace has a replication
factor = 3 and the number of racks is 2 with 2 datacenters.
We want the writes spread across all the nodes but we wanted the reads
isolated from the writes to keep the load on that node low and to be able
to identify problems in the consumers (reads) or producers (writes)
applications.
It looks like that each rack contains an entire copy of the data so this
would lead to replicate for each rack and then for each node the
information. If I am correct if we have  a keyspace with 100GB and
Replication Factor = 3 and RACKS = 3 => 100 * 3 * 3 = 900GB
If I had only one rack across 2 or even 3 availability zone I would save in
space and I would have 300GB only. Please correct me if I am wrong.

Best,

Sergio



Il giorno mer 23 ott 2019 alle ore 09:21 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Datacenters and racks are different concepts.  While they don't have to be
> associated with their historical meanings, the historical meanings probably
> provide a helpful model for understanding what you want from them.
>
> When companies own their own physical servers and have them housed
> somewhere, the questions arise on where you want to locate any particular
> server.  It's a balancing act on things like network speed of related
> servers being able to talk to each other, versus fault-tolerance of having
> many servers not all exposed to the same risks.
>
> "Same rack" in that physical world tended to mean something like "all
> behind the same network switch and all sharing the same power bus".  The
> morning after an electrical glitch fries a power bus and thus everything in
> that rack, you realize you wished you didn't have so many of the same type
> of server together.  Well, they were servers.  Now they are door stops.
> Badness and sadness.
>
> That's kind of the mindset to have in mind with racks in Cassandra.  It's
> an artifact for you to separate servers into pools so that the disparate
> pools have hopefully somewhat independent infrastructure risks.  However,
> all those servers are still doing the same kind of work, are the same
> version, etc.
>
> Datacenters are amalgams of those racks, and how similar or different they
> are from each other depends on what you want to do with them.  What is true
> is that if you have N datacenters, each one of them must have enough disk
> storage to house all the data.  The actual physical footprint of that data
> in each DC depends on the replication factors in play.
>
> Note that you sorta can't have "one datacenter for writes" because the
> writes will replicate across the data centers.  You could definitely choose
> to have only one that takes read queries, but best to think of writing as
> being universal.  One scenario you can have is where the DC not taking live
> traffic read queries is the one you use for maintenance or performance
> testing or version upgrades.
>
> One rack makes your life easier if you don't have a reason for multiple
> racks. It depends on the environment you deploy into and your fault
> tolerance goals.  If you were in AWS and wanting to spread risk across
> availability zones, then you would likely have as many racks as AZs you
> choose to be in, because that's really the point of using multiple AZs.
>
> R
>
>
> On 10/23/19, 4:06 AM, "Sergio Bilello"  wrote:
>
>  Message from External Sender
>
> Hello guys!
>
> I was reading about
> https://urldefense.proofpoint.com/v2/url?u=https-3A__cassandra.apache.org_doc_latest_architecture_dynamo.html-23networktopologystrategy=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=xmgs1uQTlmvCtIoGJKHbByZZ6aDFzS5hDQzChDPCfFA=9ZDWAK6pstkCQfdbwLNsB-ZGsK64RwXSXfAkOWtmkq4=
>
> I would like to understand a concept related to the node load
> balancing.
>
> I know that Jon recommends Vnodes = 4 but right now I found a cluster
> with vnodes = 256 replication factor = 3 and 2 racks. This is unbalanced
> because the racks are not a multiplier of the replication factor.
>
> However, my plan is to move all the nodes in a single rack to
> eventually scale up and down the node in the cluster once at the time.
>
> If I had 3 racks and I would like to keep the things balanced I should
> scale up 3 nodes at the time one for each rack.
>
> If I would have 3 racks, should I have also 3 different datacenters so
> one datacenter for each rack?
>
> Can I have 2 datacenters and 3 racks? If this is possible one
> datacenter would have mor

Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Sergio Bilello
Hello guys!
I was reading about 
https://cassandra.apache.org/doc/latest/architecture/dynamo.html#networktopologystrategy
I would like to understand a concept related to the node load balancing.
I know that Jon recommends Vnodes = 4 but right now I found a cluster with 
vnodes = 256 replication factor = 3 and 2 racks. This is unbalanced because the 
racks are not a multiplier of the replication factor.
However, my plan is to move all the nodes in a single rack to eventually scale 
up and down the node in the cluster once at the time. 
If I had 3 racks and I would like to keep the things balanced I should scale up 
3 nodes at the time one for each rack.
If I would have 3 racks, should I have also 3 different datacenters so one 
datacenter for each rack? 
Can I have 2 datacenters and 3 racks? If this is possible one datacenter would 
have more nodes than the others? Could it be a problem?
I am thinking to split my cluster in one datacenter for reads and one for 
writes and keep all the nodes in the same rack so I can scale up once node at 
the time.

Please correct me if I am wrong

Thanks,

Sergio

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Sergio
Thanks Jon!

I used that tool and I did a test to compare LCS and STCS and it works
great. However, I was referring to the JVM flags that you use since there
are a lot of flags that I found as default and I would like to exclude the
unused or wrong ones from the current configuration.

I have also another thread opened where I am trying to figure out Kernel
Settings for TCP
https://lists.apache.org/thread.html/7708c22a1d95882598cbcc29bc34fa54c01fcb33c40bb616dcd3956d@%3Cuser.cassandra.apache.org%3E

Do you have anything to add to that?

Thanks,

Sergio

Il giorno lun 21 ott 2019 alle ore 15:09 Jon Haddad  ha
scritto:

> tlp-stress comes with workloads pre-baked, so there's not much
> configuration to do.  The main flags you'll want are going to be:
>
> -d : duration, I highly recommend running your test for a few days
> --compaction
> --compression
> -p: number of partitions
> -r: % of reads, 0-1
>
> For example, you might run:
>
> tlp-stress run KeyValue -d 24h --compaction lcs -p 10m -r .9
>
> for a basic key value table, running for 24 hours, using LCS, 10 million
> partitions, 90% reads.
>
> There's a lot of options. I won't list them all here, it's why I wrote the
> manual :)
>
> Jon
>
>
> On Mon, Oct 21, 2019 at 1:16 PM Sergio  wrote:
>
>> Thanks, guys!
>> I just copied and paste what I found on our test machines but I can
>> confirm that we have the same settings except for 8GB in production.
>> I didn't select these settings and I need to verify why these settings
>> are there.
>> If any of you want to share your flags for a read-heavy workload it would
>> be appreciated, so I would replace and test those flags with TLP-STRESS.
>> I am thinking about different approaches (G1GC vs ParNew + CMS)
>> How many GB for RAM do you dedicate to the OS in percentage or in an
>> exact number?
>> Can you share the flags for ParNew + CMS that I can play with it and
>> perform a test?
>>
>> Best,
>> Sergio
>>
>>
>> Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback <
>> rpinchb...@tripadvisor.com> ha scritto:
>>
>>> Since the instance size is < 32gb, hopefully swap isn’t being used, so
>>> it should be moot.
>>>
>>>
>>>
>>> Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably
>>> doesn’t do anything for you.  I believe that only applies to CMS, not
>>> G1GC.  I also wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good
>>> thing on AWS (or anything virtualized), you’d have to run your own tests
>>> and find out.
>>>
>>>
>>>
>>> R
>>>
>>> *From: *Jon Haddad 
>>> *Reply-To: *"user@cassandra.apache.org" 
>>> *Date: *Monday, October 21, 2019 at 12:06 PM
>>> *To: *"user@cassandra.apache.org" 
>>> *Subject: *Re: [EXTERNAL] Re: GC Tuning
>>> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>>>
>>>
>>>
>>> *Message from External Sender*
>>>
>>> One thing to note, if you're going to use a big heap, cap it at 31GB,
>>> not 32.  Once you go to 32GB, you don't get to use compressed pointers [1],
>>> so you get less addressable space than at 31GB.
>>>
>>>
>>>
>>> [1]
>>> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.codecentric.de_en_2014_02_35gb-2Dheap-2Dless-2D32gb-2Djava-2Djvm-2Dmemory-2Doddities_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=e9Ahs5XXRBicgUhMZQaboxsqb6jXpjvo48kEojUWaQc=Q7jI4ZEqVMFZIMPoSXTvMebG5fWOUJ6lhDOgWGxiHg8=>
>>>
>>>
>>>
>>> On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R <
>>> sean_r_dur...@homedepot.com> wrote:
>>>
>>> I don’t disagree with Jon, who has all kinds of performance tuning
>>> experience. But for ease of operation, we only use G1GC (on Java 8),
>>> because the tuning of ParNew+CMS requires a high degree of knowledge and
>>> very repeatable testing harnesses. It isn’t worth our time. As a previous
>>> writer mentioned, there is usually better return on our time tuning the
>>> schema (aka helping developers understand Cassandra’s strengths).
>>>
>>>
>>>
>>> We use 16 – 32 GB heaps, nothing smaller than that.
>>>
>>>
>>>
>>> Sean Durity
>>>
>>>
>>>
>>> *From:* Jon Haddad 
>>> *Sent:* Monday, October 21, 2019 10:43 AM
>>> *To:* user@

Re: Cassandra Recommended System Settings

2019-10-21 Thread Sergio
Thanks Elliott!

How do you know if there is too much RAM used for those settings?

Which metrics do you keep track of?

What would you recommend instead?

Best,

Sergio

On Mon, Oct 21, 2019, 1:41 PM Elliott Sims  wrote:

> Based on my experiences, if you have a new enough kernel I'd strongly
> suggest switching the TCP scheduler algorithm to BBR.  I've found the rest
> tend to be extremely sensitive to even small amounts of packet loss among
> cluster members where BBR holds up well.
>
> High ulimits for basically everything are probably a good idea, although
> "unlimited" may not be purely optimal for all cases.
> The TCP keepalive settings are probably only necessary for traffic
> buggy/misconfigured firewalls, but shouldn't really do any harm on a modern
> fast network.
>
> The TCP memory settings are pretty aggressive and probably result in
> unnecessary RAM usage.
> The net.core.rmem_default/net.core.wmem_default settings are overridden by
> the TCP-specific settings as far as I know, so they're not really
> relevant/helpful for Cassandra
> The net.ipv4.tcp_rmem/net.ipv4.tcp_wmem max settings are pretty
> aggressive.  That works out to something like 1Gbps with 130ms latency per
> TCP connection, but on a local LAN with latencies <1ms it's enough buffer
> for over 100Gbps per TCP session.  A much smaller value will probably make
> more sense for most setups.
>
>
> On Mon, Oct 21, 2019 at 10:21 AM Sergio  wrote:
>
>>
>> Hello!
>>
>> This is the kernel that I am using
>> Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018
>> x86_64 x86_64 x86_64 GNU/Linux
>>
>> Best,
>>
>> Sergio
>>
>> Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback <
>> rpinchb...@tripadvisor.com> ha scritto:
>>
>>> I don't know which distro and version you are using, but watch out for
>>> surprises in what vm.swappiness=0 means.  In older kernels it means "only
>>> use swap when desperate".  I believe that newer kernels changed to have 1
>>> mean that, and 0 means to always use the oomkiller.  Neither situation is
>>> strictly good or bad, what matters is what you intend the system behavior
>>> to be in comparison with whatever monitoring/alerting you have put in place.
>>>
>>> R
>>>
>>>
>>> On 10/18/19, 9:04 PM, "Sergio Bilello" 
>>> wrote:
>>>
>>>  Message from External Sender
>>>
>>> Hello everyone!
>>>
>>>
>>>
>>> Do you have any setting that you would change or tweak from the
>>> below list?
>>>
>>>
>>>
>>> sudo cat /proc/4379/limits
>>>
>>> Limit Soft Limit   Hard Limit
>>>  Units
>>>
>>> Max cpu time  unlimitedunlimited
>>> seconds
>>>
>>> Max file size unlimitedunlimited
>>> bytes
>>>
>>> Max data size unlimitedunlimited
>>> bytes
>>>
>>> Max stack sizeunlimitedunlimited
>>> bytes
>>>
>>> Max core file sizeunlimitedunlimited
>>> bytes
>>>
>>> Max resident set  unlimitedunlimited
>>> bytes
>>>
>>> Max processes 3276832768
>>> processes
>>>
>>> Max open files1048576  1048576
>>> files
>>>
>>> Max locked memory unlimitedunlimited
>>> bytes
>>>
>>> Max address space unlimitedunlimited
>>> bytes
>>>
>>> Max file locksunlimitedunlimited
>>> locks
>>>
>>> Max pending signals   unlimitedunlimited
>>> signals
>>>
>>> Max msgqueue size unlimitedunlimited
>>> bytes
>>>
>>> Max nice priority 00
>>>
>>> Max realtime priority 00
>>>
>>> Max realtime timeout  unlimitedunlimited
>>> us
>>>
>>>
>>>
>>> These are the sysctl settings
>>>
>>> default['cassandra']['sysctl'] = {
>>>
>>> 'net.ipv4.tcp_keepalive_time' => 60,
>>>
>>> 'net.ipv4.tcp_keepalive_probes' => 3,
>>>
&g

Re: Cassandra Recommended System Settings

2019-10-21 Thread Sergio
Hello!

This is the kernel that I am using
Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018
x86_64 x86_64 x86_64 GNU/Linux

Best,

Sergio

Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> I don't know which distro and version you are using, but watch out for
> surprises in what vm.swappiness=0 means.  In older kernels it means "only
> use swap when desperate".  I believe that newer kernels changed to have 1
> mean that, and 0 means to always use the oomkiller.  Neither situation is
> strictly good or bad, what matters is what you intend the system behavior
> to be in comparison with whatever monitoring/alerting you have put in place.
>
> R
>
>
> On 10/18/19, 9:04 PM, "Sergio Bilello"  wrote:
>
>  Message from External Sender
>
> Hello everyone!
>
>
>
> Do you have any setting that you would change or tweak from the below
> list?
>
>
>
> sudo cat /proc/4379/limits
>
> Limit Soft Limit   Hard Limit
>  Units
>
> Max cpu time  unlimitedunlimited
> seconds
>
> Max file size unlimitedunlimited
> bytes
>
> Max data size unlimitedunlimited
> bytes
>
> Max stack sizeunlimitedunlimited
> bytes
>
> Max core file sizeunlimitedunlimited
> bytes
>
> Max resident set  unlimitedunlimited
> bytes
>
> Max processes 3276832768
> processes
>
> Max open files1048576  1048576
> files
>
> Max locked memory unlimitedunlimited
> bytes
>
> Max address space unlimitedunlimited
> bytes
>
> Max file locksunlimitedunlimited
> locks
>
> Max pending signals   unlimitedunlimited
> signals
>
> Max msgqueue size unlimitedunlimited
> bytes
>
> Max nice priority 00
>
> Max realtime priority 00
>
> Max realtime timeout  unlimitedunlimitedus
>
>
>
> These are the sysctl settings
>
> default['cassandra']['sysctl'] = {
>
> 'net.ipv4.tcp_keepalive_time' => 60,
>
> 'net.ipv4.tcp_keepalive_probes' => 3,
>
> 'net.ipv4.tcp_keepalive_intvl' => 10,
>
> 'net.core.rmem_max' => 16777216,
>
> 'net.core.wmem_max' => 16777216,
>
> 'net.core.rmem_default' => 16777216,
>
> 'net.core.wmem_default' => 16777216,
>
> 'net.core.optmem_max' => 40960,
>
> 'net.ipv4.tcp_rmem' => '4096 87380 16777216',
>
> 'net.ipv4.tcp_wmem' => '4096 65536 16777216',
>
> 'net.ipv4.ip_local_port_range' => '1 65535',
>
> 'net.ipv4.tcp_window_scaling' => 1,
>
>'net.core.netdev_max_backlog' => 2500,
>
>'net.core.somaxconn' => 65000,
>
> 'vm.max_map_count' => 1048575,
>
> 'vm.swappiness' => 0
>
> }
>
>
>
> Am I missing something else?
>
>
>
> Do you have any experience to configure CENTOS 7
>
> for
>
> JAVA HUGE PAGES
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23CheckJavaHugepagessettings=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=b6lGdbtv1SN9opBsIOFRT6IX6BroMW-8Tudk9qEh3bI=
>
>
>
> OPTIMIZE SSD
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23OptimizeSSDs=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=c0S3S3V_0YHVMx2I-pyOh24MiQs1D-L73JytaSw648M=
>
>
>
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=PZFG6SXF6dL5LRJ-aUoidHnnLGpKPbpxdKstM8M9JMk=
>
>
>
> We are using AWS i3.xlarge instances
>
>
>
> Thanks,
>
>
>
> Sergio
>
>
>
> -
>
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>
>
>
>


Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Sergio
Thanks, guys!
I just copied and paste what I found on our test machines but I can confirm
that we have the same settings except for 8GB in production.
I didn't select these settings and I need to verify why these settings are
there.
If any of you want to share your flags for a read-heavy workload it would
be appreciated, so I would replace and test those flags with TLP-STRESS.
I am thinking about different approaches (G1GC vs ParNew + CMS)
How many GB for RAM do you dedicate to the OS in percentage or in an exact
number?
Can you share the flags for ParNew + CMS that I can play with it and
perform a test?

Best,
Sergio


Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback <
rpinchb...@tripadvisor.com> ha scritto:

> Since the instance size is < 32gb, hopefully swap isn’t being used, so it
> should be moot.
>
>
>
> Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably
> doesn’t do anything for you.  I believe that only applies to CMS, not
> G1GC.  I also wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good
> thing on AWS (or anything virtualized), you’d have to run your own tests
> and find out.
>
>
>
> R
>
> *From: *Jon Haddad 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Monday, October 21, 2019 at 12:06 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: [EXTERNAL] Re: GC Tuning
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>
> *Message from External Sender*
>
> One thing to note, if you're going to use a big heap, cap it at 31GB, not
> 32.  Once you go to 32GB, you don't get to use compressed pointers [1], so
> you get less addressable space than at 31GB.
>
>
>
> [1]
> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.codecentric.de_en_2014_02_35gb-2Dheap-2Dless-2D32gb-2Djava-2Djvm-2Dmemory-2Doddities_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=e9Ahs5XXRBicgUhMZQaboxsqb6jXpjvo48kEojUWaQc=Q7jI4ZEqVMFZIMPoSXTvMebG5fWOUJ6lhDOgWGxiHg8=>
>
>
>
> On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R <
> sean_r_dur...@homedepot.com> wrote:
>
> I don’t disagree with Jon, who has all kinds of performance tuning
> experience. But for ease of operation, we only use G1GC (on Java 8),
> because the tuning of ParNew+CMS requires a high degree of knowledge and
> very repeatable testing harnesses. It isn’t worth our time. As a previous
> writer mentioned, there is usually better return on our time tuning the
> schema (aka helping developers understand Cassandra’s strengths).
>
>
>
> We use 16 – 32 GB heaps, nothing smaller than that.
>
>
>
> Sean Durity
>
>
>
> *From:* Jon Haddad 
> *Sent:* Monday, October 21, 2019 10:43 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: GC Tuning
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2018_04_11_gc-2Dtuning.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=e9Ahs5XXRBicgUhMZQaboxsqb6jXpjvo48kEojUWaQc=YFRUQ6Rdb5mcFf6GqguRYCsrcAcP6KzjozIgYp56riE=>
>
>
>
> I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
> comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
> it is, but I like to verify first.  The pause times with ParNew + CMS are
> generally lower than G1 when tuned right, but as Chris said it can be
> tricky.  If you aren't willing to spend the time understanding how it works
> and why each setting matters, G1 is a better option.
>
>
>
> I wouldn't run Cassandra in production on less than 8GB of heap - I
> consider it the absolute minimum.  For G1 I'd use 16GB, and never 4GB with
> Cassandra unless you're rarely querying it.
>
>
>
> I typically use the following as a starting point now:
>
>
>
> ParNew + CMS
>
> 16GB heap
>
> 10GB new gen
>
> 2GB memtable cap, otherwise you'll spend a bunch of time copying around
> memtables (cassandra.yaml)
>
> Max tenuring threshold: 2
>
> survivor ratio 6
>
>
>
> I've also done some tests with a 30GB heap, 24 GB of which was new gen.
> This worked surprisingly well in my tests since it essentially keeps
> everything out of the old gen.  New gen allocations are just a pointer bump
> and are pretty fast, so in my (limited) tests of this I was seeing really
> good p99 times.  I was seeing a 200-400 ms pause roughly once a minute
> running a workload that deliberately wasn't hitting a resource limit
> (testing real world looking stress vs overwhelming the cluster).
>
>
>
&g

Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-20 Thread Sergio
Thanks for the answer.

This is the JVM version that I have right now.

openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

These are the current flags. Would you change anything in a i3x.large aws
node?

java -Xloggc:/var/log/cassandra/gc.log
-Dcassandra.max_queued_native_transport_requests=4096 -ea
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
-XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103
-XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB
-XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:+UseG1GC
-XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=45 -XX:G1HeapRegionSize=0
-XX:-ParallelRefProcEnabled -Xms3821M -Xmx3821M
-XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler
-Dcom.sun.management.jmxremote.port=7199
-Dcom.sun.management.jmxremote.rmi.port=7199
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.password.file=/etc/cassandra/conf/jmxremote.password
-Dcom.sun.management.jmxremote.access.file=/etc/cassandra/conf/jmxremote.access
-Djava.library.path=/usr/share/cassandra/lib/sigar-bin
-Djava.rmi.server.hostname=172.24.150.141 -XX:+CMSClassUnloadingEnabled
-javaagent:/usr/share/cassandra/lib/jmx_prometheus_javaagent-0.3.1.jar=10100:/etc/cassandra/default.conf/jmx-export.yml
-Dlogback.configurationFile=logback.xml
-Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir=
-Dcassandra-pidfile=/var/run/cassandra/cassandra.pid
-Dcassandra-foreground=yes -cp
/etc/cassandra/conf:/usr/share/cassandra/lib/airline-0.6.jar:/usr/share/cassandra/lib/antlr-runtime-3.5.2.jar:/usr/share/cassandra/lib/asm-5.0.4.jar:/usr/share/cassandra/lib/caffeine-2.2.6.jar:/usr/share/cassandra/lib/cassandra-driver-core-3.0.1-shaded.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.9.jar:/usr/share/cassandra/lib/commons-lang3-3.1.jar:/usr/share/cassandra/lib/commons-math3-3.2.jar:/usr/share/cassandra/lib/compress-lzf-0.8.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.4.jar:/usr/share/cassandra/lib/concurrent-trees-2.4.0.jar:/usr/share/cassandra/lib/disruptor-3.0.1.jar:/usr/share/cassandra/lib/ecj-4.4.2.jar:/usr/share/cassandra/lib/guava-18.0.jar:/usr/share/cassandra/lib/HdrHistogram-2.1.9.jar:/usr/share/cassandra/lib/high-scale-lib-1.0.6.jar:/usr/share/cassandra/lib/hppc-0.5.4.jar:/usr/share/cassandra/lib/jackson-core-asl-1.9.13.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.9.13.jar:/usr/share/cassandra/lib/jamm-0.3.0.jar:/usr/share/cassandra/lib/javax.inject.jar:/usr/share/cassandra/lib/jbcrypt-0.3m.jar:/usr/share/cassandra/lib/jcl-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/jctools-core-1.2.1.jar:/usr/share/cassandra/lib/jflex-1.6.0.jar:/usr/share/cassandra/lib/jmx_prometheus_javaagent-0.3.1.jar:/usr/share/cassandra/lib/jna-4.2.2.jar:/usr/share/cassandra/lib/joda-time-2.4.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jstackjunit-0.0.1.jar:/usr/share/cassandra/lib/libthrift-0.9.2.jar:/usr/share/cassandra/lib/log4j-over-slf4j-1.7.7.jar:/usr/share/cassandra/lib/logback-classic-1.1.3.jar:/usr/share/cassandra/lib/logback-core-1.1.3.jar:/usr/share/cassandra/lib/lz4-1.3.0.jar:/usr/share/cassandra/lib/metrics-core-3.1.5.jar:/usr/share/cassandra/lib/metrics-jvm-3.1.5.jar:/usr/share/cassandra/lib/metrics-logback-3.1.5.jar:/usr/share/cassandra/lib/netty-all-4.0.44.Final.jar:/usr/share/cassandra/lib/ohc-core-0.4.4.jar:/usr/share/cassandra/lib/ohc-core-j8-0.4.4.jar:/usr/share/cassandra/lib/reporter-config3-3.0.3.jar:/usr/share/cassandra/lib/reporter-config-base-3.0.3.jar:/usr/share/cassandra/lib/sigar-1.6.4.jar:/usr/share/cassandra/lib/slf4j-api-1.7.7.jar:/usr/share/cassandra/lib/snakeyaml-1.11.jar:/usr/share/cassandra/lib/snappy-java-1.1.1.7.jar:/usr/share/cassandra/lib/snowball-stemmer-1.3.0.581.1.jar:/usr/share/cassandra/lib/ST4-4.0.8.jar:/usr/share/cassandra/lib/stream-2.5.2.jar:/usr/share/cassandra/lib/thrift-server-0.3.7.jar:/usr/share/cassandra/apache-cassandra-3.11.3.jar:/usr/share/cassandra/apache-cassandra-thrift-3.11.3.jar:/usr/share/cassandra/stress.jar:
org.apache.cassandra.service.CassandraDaemon

Best,

Sergio

Il giorno sab 19 ott 2019 alle ore 14:30 Chris Lohfink 
ha scritto:

> "It depends" on your version and heap size but G1 is easier to get right
> so probably wanna stick with that unless you are using small heaps or
> really interested in tuning it (likely for massively smaller gains then
> tuning your data model). There is no GC algo that is strictly better than
> others in all scenarios unfortunately. If your JVM supports it, ZGC or
> Shenandoah are likely going to give you the best latencies.
>
> Chris
>
> On Fri, Oct 18, 2019 at 8:41 PM Sergio Bilello 
&g

Re: Cassandra Repair question

2019-10-19 Thread Sergio
Use Cassandra reaper

On Fri, Oct 18, 2019, 10:12 PM Krish Donald  wrote:

> Thanks Manish,
>
> What is the best and fastest way to repair a table using nodetool repair ?
> We are using 256 vnodes .
>
>
> On Fri, Oct 18, 2019 at 10:05 PM manish khandelwal <
> manishkhandelwa...@gmail.com> wrote:
>
>> No it will only cover primary ranges of nodes on single rac. Repair with
>> -pr option is to be run on all nodes in a rolling manner.
>>
>> Regards
>> Manish
>>
>> On 19 Oct 2019 10:03, "Krish Donald"  wrote:
>>
>>> Hi Cassandra experts,
>>>
>>>
>>> We are on Cassandra 3.11.1.
>>>
>>> We have to run repairs for a big cluster.
>>>
>>> We have 2 DCs.
>>>
>>> 3 RACs in each DC.
>>>
>>> Replication factor is 3 for each datacenter .
>>>
>>> So if I run repair on all nodes of a single  RAC with "pr" option then
>>> ideally it will cover all the ranges.
>>>
>>> Please correct my understanding.
>>>
>>>
>>> Thanks
>>>
>>>
>>>


GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-18 Thread Sergio Bilello
Hello!

Is it still better to use ParNew + CMS Is it still better than G1GC  these days?

Any recommendation for i3.xlarge nodes read-heavy workload?


Thanks,

Sergio

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Cassandra Recommended System Settings

2019-10-18 Thread Sergio Bilello
Hello everyone!

Do you have any setting that you would change or tweak from the below list?

sudo cat /proc/4379/limits
Limit Soft Limit   Hard Limit   Units
Max cpu time  unlimitedunlimitedseconds
Max file size unlimitedunlimitedbytes
Max data size unlimitedunlimitedbytes
Max stack sizeunlimitedunlimitedbytes
Max core file sizeunlimitedunlimitedbytes
Max resident set  unlimitedunlimitedbytes
Max processes 3276832768processes
Max open files1048576  1048576  files
Max locked memory unlimitedunlimitedbytes
Max address space unlimitedunlimitedbytes
Max file locksunlimitedunlimitedlocks
Max pending signals   unlimitedunlimitedsignals
Max msgqueue size unlimitedunlimitedbytes
Max nice priority 00
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus

These are the sysctl settings
default['cassandra']['sysctl'] = {
'net.ipv4.tcp_keepalive_time' => 60, 
'net.ipv4.tcp_keepalive_probes' => 3, 
'net.ipv4.tcp_keepalive_intvl' => 10,
'net.core.rmem_max' => 16777216,
'net.core.wmem_max' => 16777216,
'net.core.rmem_default' => 16777216,
'net.core.wmem_default' => 16777216,
'net.core.optmem_max' => 40960,
'net.ipv4.tcp_rmem' => '4096 87380 16777216',
'net.ipv4.tcp_wmem' => '4096 65536 16777216',
'net.ipv4.ip_local_port_range' => '1 65535',
'net.ipv4.tcp_window_scaling' => 1,
   'net.core.netdev_max_backlog' => 2500,
   'net.core.somaxconn' => 65000,
'vm.max_map_count' => 1048575,
'vm.swappiness' => 0
}

Am I missing something else?

Do you have any experience to configure CENTOS 7
for 
JAVA HUGE PAGES
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html#CheckJavaHugepagessettings

OPTIMIZE SSD
https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html#OptimizeSSDs

https://docs.datastax.com/en/dse/5.1/dse-admin/datastax_enterprise/config/configRecommendedSettings.html

We are using AWS i3.xlarge instances

Thanks,

Sergio

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-16 Thread Sergio Bilello
Hello guys!

I performed a thread dump 
https://fastthread.io/my-thread-report.jsp?p=c2hhcmVkLzIwMTkvMTAvMTcvLS1kdW1wLnR4dC0tMC0zMC00MA==;
 while try to join the node with

-Dcassandra.join_ring=false

OR
-Dcassandra.join.ring=false

OR

-Djoin.ring=false

because the node spiked in load and latency was affecting the clients.

With or without that flag the node is high in latency and I see the load sky 
rocketing when the number of TCP established connections increases

Analyzing the /var/log/messages I am able to read

Oct 17 00:23:39 prod-personalization-live-data-cassandra-08 cassandra: INFO 
[Service Thread] 2019-10-17 00:23:39,030 GCInspector.java:284 - G1 Young 
Generation GC in 255ms. G1 Eden Space: 361758720 -> 0; G1 Old Gen: 1855455944 
-> 1781007048; G1 Survivor Space: 39845888 -> 32505856;

Oct 17 00:23:40 prod-personalization-live-data-cassandra-08 cassandra: INFO 
[ScheduledTasks:1] 2019-10-17 00:23:40,352 NoSpamLogger.java:91 - Some 
operations were slow, details available at debug level (debug.log)


Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: TCP: 
request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. Check 
SNMP counters.

I don't see anything on debug.log that looks to be relevant

The machine is on aws with 4 cpu with 32GB Ram and 1 TB SSD i3.xlarge





[sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$ nodetool tpstats

Pool Name Active Pending Completed Blocked All time blocked

ReadStage 32 53 559304 0 0

MiscStage 0 0 0 0 0

CompactionExecutor 1 107 118 0 0

MutationStage 0 0 2695 0 0

MemtableReclaimMemory 0 0 11 0 0

PendingRangeCalculator 0 0 33 0 0

GossipStage 0 0 4314 0 0

SecondaryIndexManagement 0 0 0 0 0

HintsDispatcher 0 0 0 0 0

RequestResponseStage 0 0 421865 0 0

Native-Transport-Requests 22 0 1903400 0 0

ReadRepairStage 0 0 59078 0 0

CounterMutationStage 0 0 0 0 0

MigrationStage 0 0 0 0 0

MemtablePostFlush 0 0 32 0 0

PerDiskMemtableFlushWriter_0 0 0 11 0 0

ValidationExecutor 0 0 0 0 0

Sampler 0 0 0 0 0

MemtableFlushWriter 0 0 11 0 0

InternalResponseStage 0 0 0 0 0

ViewMutationStage 0 0 0 0 0

AntiEntropyStage 0 0 0 0 0

CacheCleanupExecutor 0 0 0 0 0



Message type Dropped

READ 0

RANGE_SLICE 0

_TRACE 0

HINT 0

MUTATION 0

COUNTER_MUTATION 0

BATCH_STORE 0

BATCH_REMOVE 0

REQUEST_RESPONSE 0

PAGED_RANGE 0

READ_REPAIR 0

[sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$





top - 01:44:15 up 2 days, 1:45, 4 users, load average: 34.45, 27.71, 15.37

Tasks: 140 total, 1 running, 74 sleeping, 0 stopped, 0 zombie

%Cpu(s): 90.0 us, 4.5 sy, 3.0 ni, 1.1 id, 0.0 wa, 0.0 hi, 1.4 si, 0.0 st

KiB Mem : 31391772 total, 250504 free, 10880364 used, 20260904 buff/cache

KiB Swap: 0 total, 0 free, 0 used. 19341960 avail Mem



PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

20712 cassand+ 20 0 194.1g 14.4g 4.6g S 392.0 48.2 74:50.48 java

20823 sergio.+ 20 0 124856 6304 3136 S 1.7 0.0 0:13.51 htop

7865 root 20 0 1062684 39880 11428 S 0.7 0.1 4:06.02 ir_agent

3557 consul 20 0 41568 30192 18832 S 0.3 0.1 13:16.37 consul

7600 root 20 0 2082700 46624 11880 S 0.3 0.1 4:14.60 ir_agent

1 root 20 0 193660 7740 5220 S 0.0 0.0 0:56.36 systemd

2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd

4 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 kworker/0:0H

6 root 0 -20 0 0 0 I 0.0 0.0 0:00.00 mm_percpu_wq

7 root 20 0 0 0 0 S 0.0 0.0 0:06.04 ksoftirqd/0





[sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$ free

total used free shared buff/cache available

Mem: 31391772 10880916 256732 426552 20254124 19341768

Swap: 0 0 0

[sergio.bilello@prod-personalization-live-data-cassandra-08 ~]$







bash-4.2$ java -jar sjk.jar ttop -p 20712

Monitoring threads ...



2019-10-17T01:45:33.352+ Process summary

process cpu=363.58%

application cpu=261.91% (user=248.65% sys=13.26%)

other: cpu=101.67%

thread count: 474

heap allocation rate 583mb/s

[39] user=13.56% sys=-0.59% alloc= 11mb/s - OptionalTasks:1

[000379] user= 8.57% sys=-0.27% alloc= 18mb/s - ReadStage-19

[000380] user= 7.85% sys= 0.22% alloc= 19mb/s - Native-Transport-Requests-21

[000295] user= 7.14% sys= 0.23% alloc= 14mb/s - Native-Transport-Requests-5

[000378] user= 7.14% sys=-0.03% alloc= 22mb/s - Native-Transport-Requests-17

[000514] user= 6.42% sys= 0.12% alloc= 20mb/s - Native-Transport-Requests-85

[000293] user= 6.66% sys=-0.32% alloc= 12mb/s - Native-Transport-Requests-2

[000392] user= 6.19% sys= 0.14% alloc= 9545kb/s - Native-Transport-Requests-12

[000492] user= 5.71% sys=-0.24% alloc= 15mb/s - Native-Transport-Requests-24

[000294] user= 5.23% sys=-0.25% alloc= 14mb/s - Native-Transport-Requests-3

[000381] user= 5.47% sys=-0.52% alloc= 7430kb/s - Native-Transport-Requests-23

[000672] user= 4.52% sys= 0.25% alloc= 14mb/s - Native-Transport-Requests-270

[000296] user= 5.23% sys=-0.47% alloc= 13mb/s - ReadStage-7

[000673] user= 4.52% sys= 0.05% alloc= 13mb/s - Native-Transport-Requests-269

[000118] user= 

Cassadra node join problem

2019-10-14 Thread Sergio Bilello
Problem:
The cassandra node does not work even after restart throwing this exception:
WARN  [Thread-83069] 2019-10-11 16:13:23,713 CustomTThreadPoolServer.java:125 - 
Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Socket closed
at 
org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:109)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:36)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:60) 
~[libthrift-0.9.2.jar:0.9.2]
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:113)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.run(ThriftServer.java:134)
 [apache-cassandra-3.11.4.jar:3.11.4]

The CPU Load goes to 50 and it becomes unresponsive.

Node configuration:
OS: Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018 
x86_64 x86_64 x86_64 GNU/Linux

This is a working node that does not have the recommended settings but it is 
working and it is one of the first node in the cluster
cat /proc/23935/limits
Limit Soft Limit   Hard Limit   Units
Max cpu time  unlimitedunlimitedseconds
Max file size unlimitedunlimitedbytes
Max data size unlimitedunlimitedbytes
Max stack size8388608  unlimitedbytes
Max core file size0unlimitedbytes
Max resident set  unlimitedunlimitedbytes
Max processes 122422   122422   processes
Max open files6553665536files
Max locked memory 6553665536bytes
Max address space unlimitedunlimitedbytes
Max file locksunlimitedunlimitedlocks
Max pending signals   122422   122422   signals
Max msgqueue size 819200   819200   bytes
Max nice priority 00
Max realtime priority 00
Max realtime timeout  unlimitedunlimitedus


I tried to bootstrap a new node that joins the existing cluster. 
The disk space used is around 400GB SSD over 885GB available

At my first attempt, the node failed and got restarted over and over by 
systemctl that does not 
honor the limits configuration specified and thrown

Caused by: java.nio.file.FileSystemException: 
/mnt/cassandra/data/system_schema/columns-24101c25a2ae3af787c1b40ee1aca33f/md-52-big-Index.db:
 Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) 
~[na:1.8.0_161]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) 
~[na:1.8.0_161]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) 
~[na:1.8.0_161]
at 
sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177)
 ~[na:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[na:1.8.0_161]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[na:1.8.0_161]
at 
org.apache.cassandra.io.util.SequentialWriter.openChannel(SequentialWriter.java:104)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
.. 20 common frames omitted
^C

I fixed  the above by stopping cassandra, cleaning commitlog, saved_caches, 
hints and data directory and restarting it and getting the PID and run the 2 
commands below
sudo prlimit -n1048576 -p 
sudo prlimit -u32768 -p 
because at the beginning the node didn't even joint the cluster. it was 
reported by UJ.

After fixing the max open file problem, The node from UpJoining passed to the 
status UpNormal
The node joined the cluster but after a while, it started to throw

WARN  [Thread-83069] 2019-10-11 16:13:23,713 CustomTThreadPoolServer.java:125 - 
Transport error occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: java.net.SocketException: 
Socket closed
at 
org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:109)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:36)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:60) 
~[libthrift-0.9.2.jar:0.9.2]
at 
org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:113)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.run(ThriftServer.java:134)
 [apache-cassandra-3.11.4.jar:3.11.4]


I compared