from:"Alain RODRIGUEZ"

Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-22 Thread Alain Rodriguez via user

Hi Michalis,

It's been a while since I removed a DC for the last time, but I see there
is now a protection to avoid accidentally leaving a DC without auth
capability.

This was introduced in C* 4.1 through CASSANDRA-17478 (
https://issues.apache.org/jira/browse/CASSANDRA-17478).

The process of dropping a data center might have been overlooked while
doing this work.

It's never correct for an operator to remove a DC from system_auth
> replication settings while there are currently nodes up in that DC.
>

I believe this assertion is not correct. As Jon and Jeff mentioned, usually
we remove the replication *before* decommissioning any node in the case of
removing an entire DC, for reasons exposed by Jeff. The existing
documentation is also clear about this:
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html
and https://thelastpickle.com/blog/2019/02/26/data-center-switch.html.

Michalis, the solution you suggest seems to be the (good/only?) way to go,
even though it looks like a workaround, not really "clean" and something we
need to improve. It was also mentioned here:
https://dba.stackexchange.com/questions/331732/not-a-able-to-decommission-the-old-datacenter#answer-334890.
It should work quickly, but only because this keyspace has a fairly low
amount of data, but it will still not be optimal and as fast as it should
(it should be a near no-op as explained above by Jeff). It also obliges you
to use "--force" option that could lead you to delete one of your nodes in
another DC by mistake and in a loaded cluster or a 3-node cluster - RF = 3,
this could hurt...). Having to operate using "nodetool decommission
--force" cannot be standard, but for now I can't think of anything better
for you. Maybe wait for someone else's confirmation, it's been a while
since operated Cassandra :).

I think it would make sense to fix this somehow in Cassandra. Maybe should
we ensure that no other keyspaces has a RF > 0 for this data center instead
of looking at active nodes, or that there is no client connected to the
nodes, add a manual flag somewhere, or something else? Even though I
understand the motivation to protect users against a wrongly distributed
system_auth keyspace, I think this protection should not be kept with this
implementation. If that makes sense I can create a ticket for this problem.

C*heers,


*Alain Rodriguezcasterix.fr *


Le lun. 8 avr. 2024 à 16:26, Michalis Kotsiouros (EXT) via user <
user@cassandra.apache.org> a écrit :

> Hello Jon and Jeff,
>
> Thanks a lot for your replies.
>
> I completely get your points.
>
> Some more clarification about my issue.
>
> When trying to update the Replication before the decommission, I get the
> following error message when I remove the replication for system_auth
> kesypace.
>
> ConfigurationException: Following datacenters have active nodes and must
> be present in replication options for keyspace system_auth: [datacenter1]
>
>
>
> This error message does not appear in the rest of the application
> keyspaces.
>
> So, may I change the procedure to:
>
>1. Make sure no clients are still writing to any nodes in the
>datacenter.
>2. Run a full repair with nodetool repair.
>3. Change all keyspaces so they no longer reference the datacenter
>being removed apart from system_auth keyspace.
>4. Run nodetool decommission using the --force option on every node in
>the datacenter being removed.
>5. Change system_auth keyspace so they no longer reference the
>datacenter being removed.
>
> BR
>
> MK
>
>
>
>
>
>
>
> *From:* Jeff Jirsa 
> *Sent:* April 08, 2024 17:19
> *To:* cassandra 
> *Cc:* Michalis Kotsiouros (EXT) 
> *Subject:* Re: Datacenter decommissioning on Cassandra 4.1.4
>
>
>
> To Jon’s point, if you remove from replication after step 1 or step 2
> (probably step 2 if your goal is to be strictly correct), the nodetool
> decommission phase becomes almost a no-op.
>
>
>
> If you use the order below, the last nodes to decommission will cause
> those surviving machines to run out of space (assuming you have more than a
> few nodes to start)
>
>
>
>
>
>
>
> On Apr 8, 2024, at 6:58 AM, Jon Haddad  wrote:
>
>
>
> You shouldn’t decom an entire DC before removing it from replication.
>
>
> —
>
>
> Jon Haddad
> Rustyrazorblade Consulting
> rustyrazorblade.com
> 
>
>
>
>
>
> On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user <
> user@cassandra.apache.org> wrote:
>
> Hello community,
>
> In our deployments, we usually rebuild the Cassandra datacenters for
> maintenance or recovery operations.
>
> The procedure used since the days of Cassandra 3.x was the one documented
> in datastax documentation. Decommissioning a datacenter | Apache
> Cassandra 3.x (datastax.com)
>

Re: Decommissioned nodes are in UNREACHABLE state

2020-05-25 Thread Alain RODRIGUEZ

Hello,

Wow. This is a 1 year old issue. Are we still talking about the same node?
Other than what I wrote in my previous message, I'm not sure how to guide
you on that one.

 I am wondering, where the information is coming from.


Me too! :).


I checked system.peers for the IP in UNREACHABLE state and it's not present
>

Have you looked at all the nodes? This system table is *NOT* distributed,
so querying this 'cqlsh -e "SELECT * FROM system.peers;"' will give
different results on each of the node. It's enough of having one node with
this node for it to show up on 'nodetool describecluster' output as
UNREACHABLE I think.

Random ideas and questions:
- Does the corresponding instance still exist?
- Are we speaking of the same node than last year? That would be the
longest lasting ghost node I've heard about :).
- Is it defined somewhere in your configuration still (Snitch config file:
cassandra-topology.properties - maybe if using PropertyFileSnitch)?
- What's the C* version you're using? Still 2.1.16?

I somewhat feel you (or I...) might be missing something here, the node has
to be referenced somewhere, if not it would disappear on restart.

Good luck with that.

C*heers,
-------
Alain Rodriguez - alain.rodrig...@datastax.com
France / Spain

https://www.datastax.com



Le sam. 23 mai 2020 à 19:07, Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> a écrit :

> any inputs here?
>
> On Sat, May 2, 2020 at 12:49 PM Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> wrote:
>
>> Hello Alain,
>>
>> Thanks for your suggestions.
>>
>> Surprisingly, the node which is in unreachable state, is not present in
>> any of the system tables. I am wondering, where the information is coming
>> from.
>> I checked system.peers for the IP in UNREACHABLE state and it's not
>> present. I tried restart of Cassandra service as well.
>>
>> On Thu, Jun 20, 2019 at 5:59 AM Alain RODRIGUEZ 
>> wrote:
>>
>>> Hello,
>>>
>>> Assuming you nodes are out for a while and you don't need the data after
>>> 60 days (or cannot get it anyway), the way to fix this is to force the node
>>> out. I would try, in this order:
>>>
>>> - nodetool removenode HOSTID
>>> - nodetool removenode force
>>>
>>> These 2 might really not work at this stage, but if they do, this is a
>>> clean way to do so.
>>> Now, to really push the ghost nodes to the exit door, it often takes:
>>>
>>> - nodetool assassinate
>>>
>>> I think Cassandra 2.1 doesn't have it, you might have to use JMX, more
>>> details here: https://thelastpickle.com/blog/2018/09/18/assassinate.html
>>> ):
>>>
>>> echo "run -b org.apache.cassandra.net:type=Gossiper
>>>> unsafeAssassinateEndpoint $IP_TO_ASSASSINATE"  | java -jar
>>>> jmxterm-1.0.0-uber.jar -l $IP_OF_LIVE_NODE:7199
>>>
>>>
>>> This should really remove the traces of the node, without any safety, no
>>> streaming, no checks, just get rid of it. So to use with a lot of care and
>>> understanding. In your situation I guess this is what will work.
>>>
>>> As a last attempt, you could try removing traces of the dead node(s)
>>> from all the live nodes 'system.peers' table. This table is local to each
>>> node, so the DELETE command is to be send to all the nodes (that have a
>>> trace of an old node).
>>>
>>> - cqlsh -e "DELETE  $IP_TO_REMOVE FROM system.peers;"
>>>
>>> but I see the node IPs in UNREACHABLE state in "nodetool
>>>> describecluster" output. I believe  they appear only for 72 hours, but in
>>>> my case I see those nodes in UNREACHABLE for ever (more than 60 days)
>>>
>>>
>>> To be more accurate,  you should never see leaving node as unreachable I
>>> believe (not even for 72 hours). The 72 hours is the time Gossip should
>>> continue referencing the old nodes. Typically when you remove the ghost
>>> nodes, they should no longer appear in 'nodetool describe' cluster at all,
>>>  I would say immediately, but still appear in 'nodetool gossipinfo' with a
>>> 'left' or 'remove' status.
>>>
>>> I hope that helps and that one of the above will do the trick (I'd bet
>>> on the assassinate :)). Also sorry it took us a while to answer you this
>>> relatively common question :);
>>>
>>> C*heers,
>>> ---
>>> Alain Rodriguez - al...@thelastpickle.com
>>> France / Spain
>>>
>>> The Last Pickle - Apache Cassandra Consulting
>>> http

Re: Elevated response times from all nodes in a data center at the same time.

2019-10-16 Thread Alain RODRIGUEZ

ngs something else to my mind. The fact latency goes lower when
there is a traffic increase, it can perfectly mean that each of the queries
sent during the spike are really efficient, and despite you might still
have some other queries being slow (even during peak hours), having that
many 'efficient/quick' requests, lowers the average/pXX latencies. Does the
max latency change over time?

You can here try to get a sense of this with:

- nodetool proxyhistograms
- nodetool tablehistograms   # For the table level stats

We checked the system.log files on our nodes, took a thread dump and
> checked for any rouge processes running on the nodes which is stealing CPU
> but we are able to find nothing.


>From what I read/understand, resource are fine, I would put these searches
aside for now. About the log file, I like to use:

- tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are
having the issues NOW)
- grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what
happened and was wrong

For now I can't think about anything else, I hope some of those ideas will
help you diagnose the problem. Once it is diagnosed, we should be able to
reason about how we can fix it.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 15 oct. 2019 à 17:26, Reid Pinchback  a
écrit :

> I’d look to see if you have compactions fronting the p99’s.  If so, then
> go back to looking at the I/O.  Disbelieve any metrics not captured at a
> high resolution for a time window around the compactions, like 100ms.  You
> could be hitting I/O stalls where reads are blocked by the flushing of
> writes.  It’s short-lived when it happens, and per-minute metrics won’t
> provide breadcrumbs.
>
>
>
> *From: *Bill Walters 
> *Date: *Monday, October 14, 2019 at 7:10 PM
> *To: *
> *Subject: *Elevated response times from all nodes in a data center at the
> same time.
>
>
>
> Hi Everyone,
>
>
>
> Need some suggestions regarding a peculiar issue we started facing in our
> production cluster for the last couple of days.
>
>
>
> Here are our Production environment details.
>
>
>
> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in
> each region.
>
> No of Nodes: 24
>
> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs
> and 2 OLAP Data centers for Analytics and Batch loads)
>
> Instance Types: r5.8x Large
>
> Average Node Size: 182 GB
>
> Work Load: Read heavy
>
> Read TPS: 22k
>
> Cassandra version: 3.0.15
>
> Java Version: JDK 181.
>
> EBS Volumes: GP2 with 1TB 3000 iops.
>
>
>
> 1. We have been running in production for more than one year and our
> experience with Cassandra is great. Experienced little hiccups here and
> there but nothing severe.
>
>
>
> 2. But recently for the past couple of days we see a behavior where our
> p99 latency in our AWS us-east-1 region OLTP data center, suddenly starts
> rising from 2 ms to 200 ms. It starts with one node where we see the 99th
> percentile Read Request latency in Datastax Opscenter starts increasing.
> And it spreads immediately, to all other 6 nodes in the data center.
>
>
>
> 3. We do not see any Read request timeouts or Exception in the our API
> Splunk logs only p99 and average latency go up suddenly.
>
>
>
> 4. We have investigated CPU level usage, Disk I/O, Memory usage and
> Network parameters for the nodes during this period and we are not
> experiencing any sudden surge in these parameters.
>
>
>
> 5. We setup client using WhiteListPolicy to send queries to each of the 6
> nodes to understand which one is bad, but we see all of them responding
> with very high latency. It doesn't happen during our peak traffic period
> sometime in the night.
>
>
>
> 6. We checked the system.log files on our nodes, took a thread dump and
> checked for any rouge processes running on the nodes which is stealing CPU
> but we are able to find nothing.
>
>
>
> 7. We even checked our the write requests coming in during this time and
> we do not see any large batch operations happening.
>
>
>
> 8. Initially we tried restarting the nodes to see if the issue can be
> mitigated but it kept happening, and we had to fail over API traffic to
> us-west-2 region OLTP data center. After a couple of hours we failed back
> and everything seems to be working.
>
>
>
> We are baffled by this behavior, only correlation we find is the "Native
> requests pending" in our Task queues when this happens.
>
>
>
> Please let us know your suggestions on how to debug this issue. Has anyone
> experienced an issue like this before.(We had issues where one node starts
> acting bad due to bad EBS volume I/O read and write time, but all nodes
> experiencing an issue at same time is very peculiar)
>
>
>
> Thank You,
>
> Bill Walters.
>

Re: cluster rolling restart

2019-10-16 Thread Alain RODRIGUEZ

Hello Marco,

No this should not be a 'normal' / 'routine' thing in a Cassandra cluster.
I can imagine it being helpful in some cases or versions of Cassandra if
they are memory issues/leaks or something like that, going wrong, but
'normally', you should not have to do that. Even more, when doing so,
you'll have to 'warm up' C* again to get to full speed, such as loading the
page/key caches again and some other stuff like starting over compactions
that were interrupted might even slow down the cluster somewhat.

Let's just say it should not harm (much), it might even be useful in some
corner cases, but restarting the cluster regularly, for no reasons, is
definitely not part of 'best practices' around Cassandra imho.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mer. 16 oct. 2019 à 09:56, Marco Gasparini <
marco.gaspar...@competitoor.com> a écrit :

> hi all,
>
> I was wondering if it is recommended to perform a rolling restart of the
> cluster once in a while.
> Is it a good practice or necessary? how often?
>
> Thanks
> Marco
>

Re: Constant blocking read repair for such a tiny table

2019-10-15 Thread Alain RODRIGUEZ

Hello Patrick,

Still in trouble with this? I must admit I'm really puzzled by your issue.
I have no real idea of what's going on. Would you share with us the output
of:

- nodetool status 
- nodetool describecluster
- nodetool gossipinfo
- nodetool tpstats

Also you said the app is running for a long time, with no changes. What
about Cassandra? Any recent operations?

I hope that with this information we might be able to understand better and
finally be able to help.

---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 4 oct. 2019 à 00:25, Patrick Lee  a
écrit :

> this table was actually leveled compaction before, just changed it to size
> tiered yesterday while researching this.
>
> On Thu, Oct 3, 2019 at 4:31 PM Patrick Lee 
> wrote:
>
>> its not really time series data.   and it's not updated very often, it
>> would have some updates but pretty infrequent. this thing should be super
>> fast, on avg it's like 1 to 2ms p99 currently but if they double - triple
>> the traffic on that table latencies go upward to 20ms to 50ms.. the only
>> odd thing i see is just that there are constant read repairs that follow
>> the same traffic pattern on the reads, which shows constant writes on the
>> table (from the read repairs), which after read repair or just normal full
>> repairs (all full through reaper, never ran any incremental repair) i would
>> expect it to not have any mismatches.  the other 5 tables they use on the
>> cluster can have the same level traffic all very simple select from table
>> by partition key which returns a single record
>>
>> On Thu, Oct 3, 2019 at 4:21 PM John Belliveau 
>> wrote:
>>
>>> Hi Patrick,
>>>
>>>
>>>
>>> Is this time series data? If so, I have run into issues with repair on
>>> time series data using the SizeTieredCompactionStrategy. I have had
>>> better luck using the TimeWindowCompactionStrategy.
>>>
>>>
>>>
>>> John
>>>
>>>
>>>
>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>> Windows 10
>>>
>>>
>>>
>>> *From: *Patrick Lee 
>>> *Sent: *Thursday, October 3, 2019 5:14 PM
>>> *To: *user@cassandra.apache.org
>>> *Subject: *Constant blocking read repair for such a tiny table
>>>
>>>
>>>
>>> I have a cluster that is running 3.11.4 ( was upgraded a while back from
>>> 2.1.16 ).  what I see is a steady rate of read repair which is about 10%
>>> constantly on only this 1 table.  Repairs have been run (actually several
>>> times).  The table does not have a lot of writes to it so after repair, or
>>> even after a read repair I would expect it to be fine.  the reason i'm
>>> having to dig into this so much is for the fact that under a much large
>>> traffic load than their normal traffic, latencies are higher than the app
>>> team wants
>>>
>>>
>>>
>>> I mean this thing is tiny, it's a 12x12 cluster but this 1 table is like
>>> 1GB per node on disk.
>>>
>>>
>>>
>>> the application team is doing reads at LOCAL_QUORUM and I can simulate
>>> this on that cluster by running a query using quorum and/or local_quorum
>>> and in the trace can see every time running the query it comes back with a
>>> DigestMismatchException no matter how many times I run it. that record
>>> hasn't been updated by the application for several months.
>>>
>>>
>>>
>>> repairs are scheduled and run every 7 days via reaper, recently in the
>>> past week this table has been repaired at least 3 times.  every time there
>>> are mismatches and data streams back and forth but yet still a constant
>>> rate of read repairs.
>>>
>>>
>>>
>>> curious if anyone has any recommendations to look info further or have
>>> experienced anything like this?
>>>
>>>
>>>
>>> this node has been up for 24 hours.. this is the netstats for read
>>> repairs
>>>
>>> Mode: NORMAL
>>> Not sending any streams.
>>> Read Repair Statistics:
>>> Attempted: 7481
>>> Mismatch (Blocking): 11425375
>>> Mismatch (Background): 17
>>> Pool NameActive   Pending  Completed   Dropped
>>> Large messages  n/a 0   1232 0
>>> Small messages  n/a 0  3959036

Re: Assassinate fails

2019-08-29 Thread Alain RODRIGUEZ

Hello Alex,

long time  - I had to wait for a quiet week to try this. I finally did, I
> thought I'd give you some feedback.


Thanks for taking the time to share this, I guess it might be useful to
some other people around to know the end of the story ;-).

Glad this worked for you,

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 16 août 2019 à 08:16, Alex  a écrit :

> Hello Alain,
>
> long time  - I had to wait for a quiet week to try this. I finally did, I
> thought I'd give you some feedback.
>
> Short reminder: one of the nodes of my 3.9 cluster died and I replaced it.
> But it still appeared in nodetool status, on one node with a "null" host_id
> and on another with the same host_id of its replacement. nodetool
> assassinate failed and I could not decommission or remove any other node on
> the cluster.
>
> Basically, after backup and preparing another cluster in case anything
> went wrong, I did :
>
> DELETE FROM system.peers WHERE peer = '192.168.1.18';
>
> and restarted cassandra on the two nodes still seeing the zombie node.
>
> After the first restart, the cassandra system.log was filled with:
>
> java.lang.NullPointerException: null
> WARN  [MutationStage-2] 2019-08-15 15:31:44,735
> AbstractLocalAwareExecutorService.java:169 - Uncaught exception on thread
> Thread[MutationStage-2,5,main]:
>
> So... I restarted again. The error disappeared. I ran a full repair and
> everything seems to be back in order. I could decommission a node without
> problem.
>
> Thanks for your help !
>
> Alex
>
>
>
>
> Le 05.04.2019 10:55, Alain RODRIGUEZ a écrit :
>
> Alex,
>
>
>> Well, I tried : rolling restart did not work its magic.
>
>
> Sorry to hear and for misleading you. May faith into the rolling restart
> magical power went down a bit, but I still think it was worth a try :D.
>
>> @ Alain : In system.peers I see both the dead node and its replacement
>> with the same ID :
>>peer | host_id
>>   --+--
>>192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
>>192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
>>
>> Is it expected ?
>>
>> If I cannot fix this, I think I will add new nodes and remove, one by
>> one, the nodes that show the dead node in nodetool status.
>>
> Well, no. This is clearly not good or expected I would say.
>
> *tl;dr - Suggested fix:*
> What I would try to fix this is the following is removing this row. It
> *should* be safe but that's only my opinion and with the condition you
> remove *only* the 'ghost/dead' nodes. Any mistake here would probably be
> costly. Again, be aware you're on a sensitive part when messing with system
> tables. Think it twice, check it twice, take a copy of the SSTables/a
> snapshot. Then I would go for it and observe changes on one node first. If
> no harm is done, continue to the next node.
>
> Considering the old node is '192.168.1.18', I would run this on all nodes
> (maybe after testing on a node) to make it simple or just run it on nodes
> that show the ghost node(s):
>
> *"DELETE FROM system.peers WHERE peer = '192.168.1.18';"*
>
> Maybe will you need to restart, I think you won't even need it. I have
> good hope that this should finally fix your issue with no harm.
>
> *More context - Idea of the problem:*
> This above, is clearly an issue I would say. Most probably the source of
> your troubles here. The problem is that I lack understanding. From where I
> stand, this kind of bugs should not happen anymore in Cassandra (I did not
> see anything similar for a while).
>
> I would blame:
> - A corner case scenario (unlikely, system tables are rather solid for a
> while). Or maybe are you using an old C* version. It *might* be related to
> this (or similar): https://issues.apache.org/jira/browse/CASSANDRA-7122)
> - A really weird operation (A succession of action might have put you in
> this state, but hard for me to say what)
> - KairosDB? I don't know It or what it does. Might it be less reliable
> than Cassandra is, and have lead to this issue? Maybe, I have no clue once
> again.
>
> *Risk of this operation and current situation:*
> Also, I *think* the current situation is relatively 'stable' (maybe just
> some hints being stored for nothing, and possibly not being able to add
> more nodes or change schema?). This is the kind of situation where
> 'rushing' a solution without understanding the impacts and risks can make
> things to go terribly wrong. Take the time to analyse my suggested fix,
>

Re: Difficulties after nodetool removenode

2019-07-04 Thread Alain RODRIGUEZ

Hello,

Just for one node, and if you have a strong consistency 'Read CL + Write CL
> RF', you can:

- force the node out with 'nodetool removenode force' if it's still around
- run a repair (just on that node, but full repair).
OR
- force the node out with 'nodetool removenode force' if it's still around
- wipe this node and replace it by itself (if you are missing a lot of data
or are not comfortable with repairs). *If you just lost a node, this might
not be safe.* Repair is safer if ran with the right options/tool.
OR
- If the node is still there you can also re-run the 'nodetool removenode'.
Data will be streamed again (to all nodes) and compacted in the future
eventually.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le sam. 29 juin 2019 à 14:36, Morten A. Iversen 
a écrit :

> Hi,
>
>
> We had a hardware issue with one node in a Cassandra cluster and had to
> use the "nodetool removenode UUID" command from a different node. This
> seems to be running fine, but one node was restarted after the "nodetool
> removenode" command was run, and now it seems all streams going from that
> node have stopped.
>
>
> On most nodes I can see both "Receiving X files, Y bytes total. Already
> received Z files, Q bytes total" and "Sending X files, Y bytes total.
> Already sent Z files, Q bytes total" messages when running nodetool
> netstats.
>
>
> Nodes are starting to complete this process, but for the node that was
> restarted after the "nodetool removenode" command I can only see the
> "receiving" messages, and on the other nodes the progress from that node
> seems to have stopped. Is there some way to restart the process on only the
> node that was restarted?
>
>
> Regards
>
> Morten Iversen
>
>

Re: Adding new DC with different version of Cassandra

2019-07-04 Thread Alain RODRIGUEZ

I agree with Jeff here: It's not recommended to do that but it should be
still fine :).

Something that might be slightly safer (even though 3.11.0 is buggy as
mentioned above...) could be to add a 3.11.0 cluster. Do the streaming with
3.11.0, upgrade the new DC only, switch clients over, terminate old DC.

Here I talked about it a bit:
https://thelastpickle.com/blog/2019/02/26/data-center-switch.html#use-cases.
There are also other information you might find useful, as this post is a
runbook that details actions to do a Data Center switch. It looks like a
good fit :).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 1 juil. 2019 à 21:38, Rahul Reddy  a
écrit :

> Thanks Jeff,
>
> We want to migrate to Apache 3.11.3 once entire cluster in apache we
> eventually decommission datastax DC
>
> On Mon, Jul 1, 2019, 9:31 AM Jeff Jirsa  wrote:
>
>> Should be fine, but you probably want to upgrade anyway, there were a few
>> really important bugs fixed since 3.11.0
>>
>> > On Jul 1, 2019, at 3:25 AM, Rahul Reddy 
>> wrote:
>> >
>> > Hello All,
>> >
>> > We have datastax Cassandra cluster which uses 3.11.0 and we want to add
>> new DC with apache Cassandra 3.11.3. we tried doing the same and data got
>> streamed to new DC. Since we are able to stream data any other issues we
>> need to consider. Is it because of same type of sstables used in both the
>> cases it let me add new DC?
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>

Re: Restore from EBS onto different cluster

2019-07-04 Thread Alain RODRIGUEZ

Hello

I would not use the SSTable Loader if there is big data sets... It's too
slow and somewhat making thing more complex I think.
I love EBS for restoring nodes, you have things even much easier than with
instance stores an incredibly efficient restore are possible.

Also when I do restore do I need to think about the token ranges of the old
> and new cluster's mapping?
>

In case you have doubt with the procedure (this keeping cluster1 name as in
solution 1, 2 and 3 below. Option 4 is somewhat different), you can check
this out:
https://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html#the-copypaste-approach-on-aws-ec2-with-ebs-volumes
.

Now I want to restore few tables/keyspaces from the snapshot volumes, so I
> created another cluster say cluster2 and attached the snapshot volumes on
> to the new cluster's ec2 nodes
>

For cluster2 what's annoying you is 'system.peers' and 'system.local' table
(this last one hold the name of the cluster). The information of the
cluster name and other nodes lives there.

Cluster2 is not starting bcz the system keyspace in the snapshot taken was
> having cluster name as cluster1 and the cluster on which it is being
> restored is cluster2. How do I do a restore in this case?
>

Solutions I can think of right now:
1 - Use a dedicated private network (VPC), not impacting prod for sure,
then go for cluster1 name for this cluster and apply the standard method to
attach the EBS (ensure you map tokens, DC, Racks, etc) correctly and start
cassandra. Just attach the disk, start cassandra and check the data is
there :).
2 - As mentioned above, you can keep 'cluster1' name and prevent new nodes
to talk to old nodes in the security group as well (but make no mistake...
less safe than a new VPC I'd say)
3 - Same idea, isolate groups of machines with a firewall (IPTABLES, ufw,
...)

If you really need/prefer to use 'cluster2' name:
4 - (Advanced solution, might request more work):
- Remove the 2 system tables mentioned above. You can pick what to do with
'system_auth' as well...
- Create a new cluster (instances, configuration management, etc),
configured with the same tokens than the original cluster.
- attach the EBS you created to the RIGHT new node (tokens - DCs, racks
might have to be mirroring the original ones, not to sure here) and cluster
the name you like.
- Start seed nodes (with explicit tokens = old tokens)
- Start remaining nodes.

I'm answering without support and might be missing a step or 2, but I have
been doing this successfully already and these are the distinct solutions I
used :).

Hope any of these helps,
C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 28 juin 2019 à 09:02, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> a écrit :

> On Fri, Jun 28, 2019 at 8:37 AM Ayub M  wrote:
>
>> Hello, I have a cluster with 3 nodes - say cluster1 on AWS EC2 instances.
>> The cluster is up and running, took snapshot of the keyspaces volume.
>>
>> Now I want to restore few tables/keyspaces from the snapshot volumes, so
>> I created another cluster say cluster2 and attached the snapshot volumes on
>> to the new cluster's ec2 nodes. Cluster2 is not starting bcz the system
>> keyspace in the snapshot taken was having cluster name as cluster1 and the
>> cluster on which it is being restored is cluster2. How do I do a restore in
>> this case? I do not want to do any modifications to the existing cluster.
>>
>
> Hi,
>
> I would try to use the same cluster name just to be able to restore it and
> ensure that nodes of cluster2 cannot talk to cluster1 by the means of
> setting up Security Groups, for example.
>
> Also when I do restore do I need to think about the token ranges of the
>> old and new cluster's mapping?
>>
>
> Absolutely.  For a successful restore you must ensure that you restore a
> snapshot from a rack (Avalability Zone, if you're using the EC2Snitch) into
> the same rack in the new cluster.
>
> Regards,
> --
> Alex
>
>

Re: Get information about GC pause (Stop the world) via JMX, it's possible ?

2019-07-04 Thread Alain RODRIGUEZ

Hello,

Metrics above are a must have and available through JMX. That's what you
need I guess.

To add up and this seems to be a topic for you at the moment, I personally
love this tool that feeds from 'gc.logs': http://gceasy.io/ for more
digging. It really helped me a few times in the past to get a quick yet
quite in-depth understanding of what was going on in the JVM and with GC.
But it's getting every time more closed (set of information available for
free is reducing).

Also you can analyse dumps with MAT (eclipse) in the worst case (it's not
as easy as... 'gceasy' ¯\_(ツ)_/¯ ).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le jeu. 27 juin 2019 à 21:52, Avinash Mandava  a
écrit :

> Here's the metrics you want. Depends on what GC you're using as Dimo said
> above.
>
> *1) If you're using CMS - Collection time / Collection count (Avg time per
> collection)*
>
> *ParNew*
> (java.lang.type=GarbageCollector.name=ParNew.CollectionTime /
> java.lang.type=GarbageCollector.name=ParNew.CollectionCount)
>
>
> *CMS*
> (java.lang.type=GarbageCollector.name=ConcurrentMarkSweep.CollectionTime /
> java.lang.type=GarbageCollector.name=ConcurrentMarkSweep.CollectionCount)
>
> *2) If you're using G1GC - Collection time / Collection count (Avg time
> per collection)*
>
> *Young Gen*
> (java.lang.type=GarbageCollector.name=G1 Young Generation.CollectionTime
>  /
> java.lang.type=GarbageCollector.name=G1 Young Generation.CollectionCount)
>
> *Old Gen*
> (java.lang.type=GarbageCollector.name=G1 Old Generation.CollectionTime /
> java.lang.type=GarbageCollector.name=G1 Old Generation.CollectionCount)
>
> On Thu, Jun 27, 2019 at 11:56 AM Dimo Velev  wrote:
>
>> That is s standard jvm metric. Connect to your cassandra node with a JMX
>> browser (jconsole, jmc, ...) and browse the metrics. Depending on the
>> garbage collector you use, they will be different but are there
>>
>> On Thu, 27 Jun 2019, 13:47 Ahmed Eljami,  wrote:
>>
>>> Hi,
>>>
>>> I want to know if it's possible to get information about GC pause
>>> duration (Stop the world) via JMX.
>>>
>>> Today, we get this information from gc.log with the JVM option
>>> XX:+PrintGCApplicationStoppedTime{color}
>>>
>>> Total time for which application threads were stopped: 0.0001273
>>> seconds, Stopping threads took: 0.196 seconds
>>>
>>> A python script is deployed on every node to parsing the gc.log
>>>
>>>
>>> and it works quite well, but on a Kubernetes environnement, we will have
>>> to create a sidecar inside a pod what we try to avoid.
>>>
>>> Thanks
>>>
>>
>
> --
> www.vorstella.com
> 408 691 8402
>

Re: Write count vs Local Write Count

2019-07-04 Thread Alain RODRIGUEZ

Hello,

I tried to get the exact understanding of it. Rather than checking the code
(which I invite you to do alternatively), I played around with CCM:

$ tail -n 100 *

==> schema_collection.cql <==

DROP KEYSPACE IF EXISTS tlp_labs;


CREATE KEYSPACE tlp_labs

  WITH REPLICATION = {

   'class' : 'SimpleStrategy',

   'replication_factor' : 2

  };



CREATE TYPE tlp_labs.item (

  csn bigint,

  name text

);



CREATE TABLE tlp_labs.products (

  product_id bigint PRIMARY KEY,

  items map>

);


CREATE TABLE tlp_labs.services (

  service_id bigint PRIMARY KEY,

  items map>

);


==> update_collections.cql <==

UPDATE tlp_labs.products USING TTL 1000 SET items = items + {10: {csn: 100,
name: 'item100'}} WHERE product_id = 1;

UPDATE tlp_labs.products USING TTL 2000 SET items = items + {20: {csn: 200,
name: 'item200'}} WHERE product_id = 1;

UPDATE tlp_labs.products USING TTL 3000 SET items = items + {30: {csn: 300,
name: 'item300'}} WHERE product_id = 2;

UPDATE tlp_labs.services USING TTL 4000 SET items = items + {40: {csn: 400,
name: 'item400'}} WHERE service_id = 4;

UPDATE tlp_labs.services USING TTL 4000 SET items = items + {40: {csn: 400,
name: 'item400'}} WHERE service_id = 3;

UPDATE tlp_labs.services USING TTL 4000 SET items = items + {40: {csn: 400,
name: 'item400'}} WHERE service_id = 5;

The tool presents values like this (on 3 distinct nodes):


$ for i in $(seq 1 3); do echo "Node $i"; ccm node$i nodetool tablestats
tlp_labs | grep -i -e Table: -e Keyspace  -e 'write count'; done

Node 1

Keyspace : tlp_labs

Write Count: 4

Table: products

Local write count: 2

Table: services

Local write count: 2

Node 2

Keyspace : tlp_labs

Write Count: 5

Table: products

Local write count: 3

Table: services

Local write count: 2

Node 3

Keyspace : tlp_labs

Write Count: 3

Table: products

Local write count: 1

Table: services

Local write count: 2




The local write count is at the table level and shows reads that really hit
that node. If at the client/coordinator level there are 6 queries and the
RF is 2 like here, you can expect a total 'Write Count' at the keyspace
level of 12.


Can any one tell me the difference b/w Write Count vs Local write count
> from node tool tablestats output ?


To answer your question It seems that the 'Write Count' is a sum of the
'Local write count' of all the tables.
Which (this Write Count sum from all nodes) should be  times bigger
than the queries sent. Does it make sense with what you're seeing?

It was a bit confusing for me at start as I was somewhat expecting this to
be the coordinator/client request counts, mostly because in Metrics the
'Local' prefix is always used in the name of 'local' counts, and the
coordinator metrics are called 'Write Count' without the local prefix :).
But it's not the case here. 'nodetool tablestats' **only** shows **local**
writes, and 'Write Count' at the keyspace level is just a sum of all the
table metrics as detailed per table below 'local write count'.

This makes more sense to me now, because we are unable to get metrics at
the coordinator level per table or keyspace. It's always per host for
client request counts metrics.

I hope I did not make it more confusing, let me know if I'm unclear or if
it doesn't suit your own observations. Thanks for the questions, I could
also rework my understanding here :).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 27 juin 2019 à 21:51, raja k  a écrit :

> Hello,
>
> Can any one tell me the difference b/w Write Count vs Local write count
> from node tool tablestats output ?
>
> Below is what I see for one of my table
>
>  Write Count: 248214002
>  Write Latency: 0.07470789510093795 ms.
>   Local write count: 1183420
>   Local write latency: NaN ms
>
> Thanks,
>
>

Re: Decommissioned nodes are in UNREACHABLE state

2019-06-20 Thread Alain RODRIGUEZ

Hello,

Assuming you nodes are out for a while and you don't need the data after 60
days (or cannot get it anyway), the way to fix this is to force the node
out. I would try, in this order:

- nodetool removenode HOSTID
- nodetool removenode force

These 2 might really not work at this stage, but if they do, this is a
clean way to do so.
Now, to really push the ghost nodes to the exit door, it often takes:

- nodetool assassinate

I think Cassandra 2.1 doesn't have it, you might have to use JMX, more
details here: https://thelastpickle.com/blog/2018/09/18/assassinate.html):

echo "run -b org.apache.cassandra.net:type=Gossiper
> unsafeAssassinateEndpoint $IP_TO_ASSASSINATE"  | java -jar
> jmxterm-1.0.0-uber.jar -l $IP_OF_LIVE_NODE:7199


This should really remove the traces of the node, without any safety, no
streaming, no checks, just get rid of it. So to use with a lot of care and
understanding. In your situation I guess this is what will work.

As a last attempt, you could try removing traces of the dead node(s) from
all the live nodes 'system.peers' table. This table is local to each node,
so the DELETE command is to be send to all the nodes (that have a trace of
an old node).

- cqlsh -e "DELETE  $IP_TO_REMOVE FROM system.peers;"

but I see the node IPs in UNREACHABLE state in "nodetool describecluster"
> output. I believe  they appear only for 72 hours, but in my case I see
> those nodes in UNREACHABLE for ever (more than 60 days)


To be more accurate,  you should never see leaving node as unreachable I
believe (not even for 72 hours). The 72 hours is the time Gossip should
continue referencing the old nodes. Typically when you remove the ghost
nodes, they should no longer appear in 'nodetool describe' cluster at all,
 I would say immediately, but still appear in 'nodetool gossipinfo' with a
'left' or 'remove' status.

I hope that helps and that one of the above will do the trick (I'd bet on
the assassinate :)). Also sorry it took us a while to answer you this
relatively common question :);

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 13 juin 2019 à 00:55, Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> a écrit :

> Hello,
>
> I have a Cassandra cluster running with 2.1.16 version of Cassandra, where
> I have decommissioned few nodes from the cluster using "nodetool
> decommission", but I see the node IPs in UNREACHABLE state in "nodetool
> describecluster" output. I believe  they appear only for 72 hours, but in
> my case I see those nodes in UNREACHABLE for ever (more than 60 days).
> Rolling restart of the nodes didn't remove them. any idea what could be
> causing here?
>
> Note: I don't see them in the nodetool status output.
>

Re: JanusGraph and Cassandra

2019-06-20 Thread Alain RODRIGUEZ

Hello,

This looks more like a JanusGraph question. Also I would rather try in the
support for that tool instead.
I never saw anyone here or elsewhere using JanusGraph, after searching I
only found 4 threads about it here. Thus I think I think even people
knowing Cassandra monitoring/metrics very well will not be able to help you
here.

If you have questions around the metrics as such, we might be able to help
you though :).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mer. 12 juin 2019 à 08:31, Vinayak Bali 
a écrit :

> Hi,
>
> I am using JanusGraph along with Cassandra as the backend. I have created
> a graph using JanusGraphFactory.open() method. I want to create one more
> graph and traverse it. Can you please help me.
>
> Regards,
> Vinayak
>
>

Re: Cassandra Tombstone

2019-06-20 Thread Alain RODRIGUEZ

Hello Aneesh,

Reading your message and answers given, I really think this post I wrote
about 3 years ago now (how quickly time goes through...) about tombstone
might be of interest to you:
https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html.
Your problem is not related to tombstone I'd say, but the first part of the
post explains how constancy work in Cassandra, and only then takes the case
of deletes/tombstones. This first part might help you solving your current
issue:
https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#cassandra-some-availability-and-consistency-considerations.
I really tried to reason from scratch, even for people whose new to
Cassandra with a very basic knowledge of internals.

For example you'd read still like:

CL.READ  = Consistency Level (CL) used for reads. Basically the number of
> nodes that will have to acknowledge the read for Cassandra to consider it
> successful.
> CL.WRITE = CL used for writes.
> RF = Replication Factor

CL.READ + CL.WRITE > RF

If what you have is an availability issue, you should just make sure the CL
is lower or equal to the RF and that all nodes are up and responsive.
FWIW, Quorum = RF/2 + 1, thus if RF is 2 and the consistency level for
deletes is quorum (ie "2 / 2 + 1 = 2") , one node down could start breaking
availability (and most probably will) as CL > number of replicas available
for certain partitions.

Reading the rest of the post might be useful while working out the design
of the schema and queries, in particular if you plan to use deletes/TTLs.

I hope that helps,
C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 18 juin 2019 à 09:56, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> a écrit :

> On Tue, Jun 18, 2019 at 8:06 AM ANEESH KUMAR K.M 
> wrote:
>
>>
>> I am using Cassandra cluster with 3 nodes which is hosted on AWS. Also we
>> have NodeJS web Application which is on AWS ELB. Now the issue is that,
>> when I add 2 or more servers (nodeJS) in AWS ELB then the delete queries
>> are not working on Cassandra.
>>
>
> Please provide a more concrete description than "not working".  Do you get
> an error?  Which one?  Does it "not working" silently, i.e. w/o an error,
> but you don't observe the expected effect?  How does the delete query look
> like, what is the effect you expect and what do you observe instead?
>
> --
> Alex
>
>

Re: node re-start delays , busy Deleting mc-txn-compaction/ Adding log file replica

2019-06-20 Thread Alain RODRIGUEZ

Also about your traces, and according to Jeff in another thread:

the incomplete sstable will be deleted during startup (in 3.0 and newer
> there’s a transaction log of each compaction in progress - that gets
> cleaned during the startup process)
>

maybe that's what you are seeing? Again, I'm not really familiar with those
traces. I find traces and debug pretty useless (or even counter-productive)
in 99% of the cases, so I don't use them much.

Le jeu. 20 juin 2019 à 12:25, Alain RODRIGUEZ  a écrit :

> Hello Asad,
>
>
>> I’m on environment with  apache Cassandra 3.11.1 with  java 1.8.0_144.
>
> One Node went OOM and crashed.
>
>
> If I remember well, firsts minor versions of C* 3.11 have memory leaks. It
> seems it was fixed in your version though.
>
> 3.11.1
>
> [...]
>
>  * BTree.Builder memory leak (CASSANDRA-13754)
>
>
> Yet other improvements were made later on:
>
>
>> 3.11.3
>
> [...]
>
>  * Remove BTree.Builder Recycler to reduce memory usage (CASSANDRA-13929)
>
>  * Reduce nodetool GC thread count (CASSANDRA-14475)
>
>
> See: https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt.
> Before digging more I would upgrade to 3.11.latest (latest = 4 or 5 I
> guess), because early versions of a major Cassandra versions are famous for
> being quite broken, even though this major is a 'bug fix only' branch.
> Also minor versions upgrades are not too risky to go through. I would
> maybe start there if you're not too sure how to dig this.
>
> If it happens again or you don't want to upgrade, it would be interesting
> to know:
> -  if the OOM happens inside the JVM or on native memory (then the OS
> would be the one sending the kill signal). These 2 issues have different
> (and sometime opposite) fixes.
> - What's the host size (especially memory) and how the heap (and maybe
> some off heap structures) are configured (at least what is not default).
> - If you saw errors in the logs and what the 'nodetool tpstats' was
> looking like when the node went down (it might have been dumped in the logs)
>
> I don't know much about those traces nor why Cassandra would take a long
> time. Though they are traces and harder to interpret for me. What does the
> INFO / WARN / ERR look like?
> Maybe opening a lot of SSTables and/or replaying a lot of commit logs,
> given the nature of the restart (post outage)?
> To speed up things, when nodes are not crashing, under normal
> circumstances, use 'nodetool drain' as part of stopping the node, before
> stopping/killing the service/process.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 18 juin 2019 à 23:43, ZAIDI, ASAD A  a écrit :
>
>>
>>
>> I’m on environment with  apache Cassandra 3.11.1 with  java 1.8.0_144.
>>
>>
>>
>> One Node went OOM and crashed. Re-starting this crashed node is taking
>> long time. Trace level debug log is showing messages like:
>>
>>
>>
>>
>>
>> Debug.log trace excerpt:
>>
>> 
>>
>>
>>
>> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-CompressionInfo.db
>>
>> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-Filter.db
>>
>> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-TOC.txt
>>
>> TRACE [main] 2019-06-18 21:30:43,455 LogTransaction.java:217 - Deleting
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_642976c0-91c3-11e9-97bb-6b1dee397c3f.log
>>
>> TRACE [main] 2019-06-18 21:30:43,458 LogReplicaSet.java:67 - Added log
>> file replica
>> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_5a6c8c90-91cc-11e9-97bb-6b1dee397c3f.log
>>
>>
>>
>>
>>
>> Above messages are repeated for unique [mc--* ] files. Such messages
>> are repeating constantly.
>>
>>
>>
>> I’m seeking help here to find out what may be going on here , any hint to
>> root cause and how I can quickly start the node. Thanks in advance.
>>
>>
>>
>> Regards/asad
>>
>>
>>
>>
>>
>>
>>
>

Re: node re-start delays , busy Deleting mc-txn-compaction/ Adding log file replica

2019-06-20 Thread Alain RODRIGUEZ

Hello Asad,


> I’m on environment with  apache Cassandra 3.11.1 with  java 1.8.0_144.

One Node went OOM and crashed.


If I remember well, firsts minor versions of C* 3.11 have memory leaks. It
seems it was fixed in your version though.

3.11.1

[...]

 * BTree.Builder memory leak (CASSANDRA-13754)


Yet other improvements were made later on:


> 3.11.3

[...]

 * Remove BTree.Builder Recycler to reduce memory usage (CASSANDRA-13929)

 * Reduce nodetool GC thread count (CASSANDRA-14475)


See: https://github.com/apache/cassandra/blob/cassandra-3.11/CHANGES.txt.
Before digging more I would upgrade to 3.11.latest (latest = 4 or 5 I
guess), because early versions of a major Cassandra versions are famous for
being quite broken, even though this major is a 'bug fix only' branch.
Also minor versions upgrades are not too risky to go through. I would maybe
start there if you're not too sure how to dig this.

If it happens again or you don't want to upgrade, it would be interesting
to know:
-  if the OOM happens inside the JVM or on native memory (then the OS would
be the one sending the kill signal). These 2 issues have different (and
sometime opposite) fixes.
- What's the host size (especially memory) and how the heap (and maybe some
off heap structures) are configured (at least what is not default).
- If you saw errors in the logs and what the 'nodetool tpstats' was looking
like when the node went down (it might have been dumped in the logs)

I don't know much about those traces nor why Cassandra would take a long
time. Though they are traces and harder to interpret for me. What does the
INFO / WARN / ERR look like?
Maybe opening a lot of SSTables and/or replaying a lot of commit logs,
given the nature of the restart (post outage)?
To speed up things, when nodes are not crashing, under normal
circumstances, use 'nodetool drain' as part of stopping the node, before
stopping/killing the service/process.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 18 juin 2019 à 23:43, ZAIDI, ASAD A  a écrit :

>
>
> I’m on environment with  apache Cassandra 3.11.1 with  java 1.8.0_144.
>
>
>
> One Node went OOM and crashed. Re-starting this crashed node is taking
> long time. Trace level debug log is showing messages like:
>
>
>
>
>
> Debug.log trace excerpt:
>
> 
>
>
>
> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-CompressionInfo.db
>
> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-Filter.db
>
> TRACE [main] 2019-06-18 21:30:43,449 LogTransaction.java:217 - Deleting
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc-9337720-big-TOC.txt
>
> TRACE [main] 2019-06-18 21:30:43,455 LogTransaction.java:217 - Deleting
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_642976c0-91c3-11e9-97bb-6b1dee397c3f.log
>
> TRACE [main] 2019-06-18 21:30:43,458 LogReplicaSet.java:67 - Added log
> file replica
> /cassandra/data/enterprise/device_connection_ws-f65649e0aea011e7baeb8166fa28890a/mc_txn_compaction_5a6c8c90-91cc-11e9-97bb-6b1dee397c3f.log
>
>
>
>
>
> Above messages are repeated for unique [mc--* ] files. Such messages
> are repeating constantly.
>
>
>
> I’m seeking help here to find out what may be going on here , any hint to
> root cause and how I can quickly start the node. Thanks in advance.
>
>
>
> Regards/asad
>
>
>
>
>
>
>

Re: How to query TTL on collections ?

2019-06-20 Thread Alain RODRIGUEZ

Hello Maxim.

I think you won't be able to do what you want this way. Collections are
supposed to be (ideally small) sets of data that you'll always read
entirely, at once. At least it seems to be working this way. Not sure about
the latest versions, but I did not hear about new design for collections.

You can set values individually in a collection as you did above (and
probably should do so to avoid massive tombstones creation), but you have
to read the whole thing at once:

```
$ ccm node1 cqlsh -e "SELECT items[10] FROM tlp_labs.products WHERE
product_id=1;"
:1:SyntaxException: line 1:12 no viable alternative at input '['
(SELECT [items][...)

$ ccm node1 cqlsh -e "SELECT items FROM tlp_labs.products WHERE
product_id=1;"
 items

 {10: {csn: 100, name: 'item100'}, 20: {csn: 200, name: 'item200'}}
```

Furthermore, you cannot query the TTL for a single item in a collection,
and as distinct columns can have distinct TTLs, you cannot query the TTL
for the whole map (collection). As you cannot get the TTL for the whole
thing, nor query a single item of the collection, I guess there is no way
to get the currently set TTL for all or part of a collection.

If you need it, you would need to redesign this table, maybe split it. Make
the collection a different table for example, that you would then be
referenced in your current table.
Another idea of hack I'm just thinking about could be to add a 'ttl' field
that would get the updates as well, any time a client updates the TTL for
an entry, you could update that 'ttl' field as well. But again, you would
still not be able to query this information only for an item or a few, it
would be querying the whole map again.

I had to test it because I could not remember about this, and I think my
observations are making sense. Sadly, there is no 'good' syntax for this
query, it's just not permitted at all I would say. Sorry I have no better
news for you :).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mer. 19 juin 2019 à 09:21, Maxim Parkachov  a
écrit :

> Hi everyone,
>
> I'm struggling to understand how can I query TTL on the row in collection
> ( Cassandra 3.11.4 ).
> Here is my schema:
>
> CREATE TYPE item (
>   csn bigint,
>   name text
> );
>
> CREATE TABLE products (
>   product_id bigint PRIMARY KEY,
>   items map>
> );
>
> And I'm creating records with TTL like this:
>
> UPDATE products USING TTL 10 SET items = items + {10: {csn: 100, name:
> 'item100'}} WHERE product_id = 1;
> UPDATE products USING TTL 20 SET items = items + {20: {csn: 200, name:
> 'item200'}} WHERE product_id = 1;
>
> As expected first records disappears after 10 seconds and the second after
> 20. But if I already have data in the table I could not figure out how to
> query TTL on the item value:
>
> SELECT TTL(items) FROM products WHERE product_id=1;
> InvalidRequest: Error from server: code=2200 [Invalid query]
> message="Cannot use selection function ttl on collections"
>
> SELECT TTL(items[10]) FROM products WHERE product_id=1;
> SyntaxException: line 1:16 mismatched input '[' expecting ')' (SELECT
> TTL(items[[]...)
>
> Any tips, hints, tricks are highly appreciated,
> Maxim.
>

Re: Can I cancel a decommissioning procedure??

2019-06-05 Thread Alain RODRIGUEZ

Sure, you're welcome, glad to hear it worked! =)

Thanks for letting us know/reporting this back here, it might matter for
other people as well.

C*heers!
Alain


Le mer. 5 juin 2019 à 07:45, William R  a écrit :

> Eventually after the reboot the decommission was cancelled. Thanks a lot
> for the info!
>
> Cheers
>
>
> Sent with ProtonMail <https://protonmail.com> Secure Email.
>
> ‐‐‐ Original Message ‐‐‐
> On Tuesday, June 4, 2019 10:59 PM, Alain RODRIGUEZ 
> wrote:
>
> > the issue is that the rest nodes in the cluster marked it as DL
> (DOWN/LEAVING) thats why I am kinda stressed.. Lets see once is up!
>
> The last information other nodes had is that this node is leaving, and
> down, that's expected in this situation. When the node comes back online,
> it should come back UN and 'quickly' other nodes should ACK it.
>
> During decommission, the node itself is responsible for streaming its data
> over. Streams were stopped as the node went down, Cassandra won't remove
> the node unless data was streamed properly (or if you force  the node out).
> I don't think that there is a decommission 'resume', and even les that it
> is enabled by default.
> Thus when the node comes back, the only possible option I see is a
> 'regular' start for that node and other to acknowledge that the node is up
> and not leaving anymore.
>
> The only consequence I expect (other than the node missing the latest
> data) is that other nodes might have some extra data due to the
> decommission attempts. If that's needed (streaming for long or no TTL), you
> can consider using 'nodetool cleanup -j 2' on all the other nodes than the
> one that went down, to remove the extra data (and free space).
>
>  I did restart, still waiting to come up (normally takes ~ 30 minutes)
>>
>
> 30 minutes to start the nodes sounds like a long time to me, but well,
> that's another topic.
>
> C*heers
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> Le mar. 4 juin 2019 à 18:31, William R  a écrit :
>
>> Hi Alain,
>>
>> Thank you for your comforting reply :)  I did restart, still waiting to
>> come up (normally takes ~ 30 minutes) , the issue is that the rest nodes in
>> the cluster marked it as DL (DOWN/LEAVING) thats why I am kinda stressed..
>> Lets see once is up!
>>
>>
>> Sent with ProtonMail <https://protonmail.com> Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Tuesday, June 4, 2019 7:25 PM, Alain RODRIGUEZ 
>> wrote:
>>
>> Hello William,
>>
>> At the moment we keep the node down before figure out a way to cancel
>>> that.
>>>
>>
>> Off the top of my head, a restart of the node is the way to go to cancel
>> a decommission.
>> I think you did the right thing and your safety measure is also the fix
>> here :).
>>
>> Did you try to bring it up again?
>>
>> If it's really critical, you can probably test that quickly with ccm (
>> https://github.com/riptano/ccm), tlp-cluster (
>> https://github.com/thelastpickle/tlp-cluster) or simply with any
>> existing dev/test environment if you have any available with some data.
>>
>> Good luck with that, a PEBKAC issue are the worst. You can do a lot of
>> damage, could always have avoided it and it makes you feel terrible.
>> It doesn't sound that bad in your case though, I've seen (and done)
>> worse  ¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are
>> unpredictable :).
>> Nonetheless, and to go back to something more serious, there are ways to
>> limit the amount and possible scope of those, such as good practices,
>> testing and automations.
>>
>> C*heers,
>> ---
>> Alain Rodriguez - al...@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>>
>>
>> Le mar. 4 juin 2019 à 17:55, William R  a
>> écrit :
>>
>>> Hi,
>>>
>>> Was an accidental decommissioning of a node and we really need to to
>>> cancel it.. is there any way? At the moment we keep the node down before
>>> figure out a way to cancel that.
>>>
>>> Thanks
>>>
>>
>>
>

Re: Can I cancel a decommissioning procedure??

2019-06-04 Thread Alain RODRIGUEZ

Hello William,

At the moment we keep the node down before figure out a way to cancel that.
>

Off the top of my head, a restart of the node is the way to go to cancel a
decommission.
I think you did the right thing and your safety measure is also the fix
here :).

Did you try to bring it up again?

If it's really critical, you can probably test that quickly with ccm (
https://github.com/riptano/ccm), tlp-cluster (
https://github.com/thelastpickle/tlp-cluster) or simply with any existing
dev/test environment if you have any available with some data.

Good luck with that, a PEBKAC issue are the worst. You can do a lot of
damage, could always have avoided it and it makes you feel terrible.
It doesn't sound that bad in your case though, I've seen (and done) worse
¯\_(ツ)_/¯. It's hard to fight PEBKACs, we, operators, are unpredictable :).
Nonetheless, and to go back to something more serious, there are ways to
limit the amount and possible scope of those, such as good practices,
testing and automations.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le mar. 4 juin 2019 à 17:55, William R  a
écrit :

> Hi,
>
> Was an accidental decommissioning of a node and we really need to to
> cancel it.. is there any way? At the moment we keep the node down before
> figure out a way to cancel that.
>
> Thanks
>

Re: Nodetool status load value

2019-05-29 Thread Alain RODRIGUEZ

Hello Simon,

Sorry if the question has already been answered.


This was probably answered here indeed (and multiple times I'm sure), but I
do not mind taking a moment to repeat this :).

About *why?*

This difference is expected. It can be due to multiple factors such as:
- Different compaction states on distinct nodes
- Ongoing compaction and temporary SSTables
- Different number of tombstones evicted (somewhat related to the first
point)
- imbalances in schema/workload (not applicable here, all nodes have 100%
of the data)
- A low number of vnodes (that is good for many other reasons) does have a
negative impact on distribution. (Not applicable, with 256 nodes, data
should be almost perfectly distributed)
- Any snapshots?
- ... (others that don't come to mind right now...)

Anyway, to answer your question more precisely:

Is it OK to have differences between nodes ?


Yes, with this proportions it is perfectly ok. Nodes have a similar dataset
and I imagine queries are well distributed. The situation seems to be
normal, at least nothing looking wrong in this `nodetool status` output I
would say.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le mer. 29 mai 2019 à 08:09, Simon ELBAZ  a écrit :

> Hi,
>
> Sorry if the question has already been answered.
>
> Where nodetool status is run on a 3 node cluster (replication factor : 3),
> the load between the different nodes is not equal.
>
> *# nodetool status opush*
> *Datacenter: datacenter1*
> *===*
> *Status=Up/Down*
> *|/ State=Normal/Leaving/Joining/Moving*
> *--  Address   Load   Tokens  Owns (effective)  Host
> ID   Rack*
> *UN  192.168.11.3  9,14 GB256 100,0%
> 989589e8-9fcf-4c2f-85e9-c0599ac872e5  rack1*
> *UN  192.168.11.2  8,54 GB256 100,0%
> 42223dd0-1adf-433c-810d-8bc87f0d3af4  rack1*
> *UN  192.168.11.4  8,92 GB256 100,0%
> 1cecacc3-c301-4ae9-a71e-1a1a944d5731  rack1*
>
> Is it OK to have differences between nodes ?
>
> Thanks for your answer.
>
> Simon
>
>

Re: Sstableloader

2019-05-29 Thread Alain RODRIGUEZ

Hello,

I can't answer this question about the sstableloader (even though I think
it should be ok). My understanding, even though I'm not really up to date
with latest Datastax work, is that DSE uses a modified but compatible
version of Cassandra, for everything that is not 'DSE feature'
specifically. Especially I expect SSTable format to be the same.
SSTable loader has always been slow and inefficient for me though I did not
use it much.

I think the way out DSE should be documented somewhere in Datastax docs, if
not I think you can ask Datastax directly (or maybe someone here can help
you).

My guess is that the safest way out, without any downtime is probably to
perform a datacenter 'switch':
- Identify the Apache Cassandra version used under the hood by DSE (5.0.7).
Let's say it's 3.11.1 (I don't know)
- Add a new Apache Cassandra datacenter to your DSE cluster using this
version (I would rather use 3.11.latest in this case though... 3.11.1 had
memory leaks and other wild issues).
- Move client to this new DC
- Shutdown the old DC.

I wrote a runbook to perform such an operation not that long ago, you can
find it here:
https://thelastpickle.com/blog/2019/02/26/data-center-switch.html

I don't know for sure that this is the best way to go out of DSE, but that
would be my guess and the first thing I would investigate (before
SSTableLoader, clearly).

Hope that helps, even though it does not directly answers the question
(that I'm unable to answer) about SSTable & SSTableLoader compatibility
with DSE clusters.

C*heers

Le mar. 28 mai 2019 à 22:22, Rahul Reddy  a
écrit :

> Hello,
>
> Does sstableloader works between datastax and Apache cassandra. I'm trying
> to migrate dse 5.0.7 to Apache 3.11.1 ?
>

Re: Collecting Latency Metrics

2019-05-29 Thread Alain RODRIGUEZ

Hello,

This metric is available indeed:

Most of the metrics available are documented here:
http://cassandra.apache.org/doc/latest/operating/metrics.html

For client requests (coordinator perspective latency):
http://cassandra.apache.org/doc/latest/operating/metrics.html#client-request-metrics
For local requests (per table/host latency, locally, no network
communication included):
http://cassandra.apache.org/doc/latest/operating/metrics.html#table-metrics

LatencySpecial type that tracks latency (in microseconds) with a Timer plus
> a Counter that tracks the total latency accrued since starting. The
> former is useful if you track the change in total latency since the last
> check. Each metric name of this type will have ‘Latency’ and ‘TotalLatency’
> appended to it.


You need 'Latency', not 'TotalLatency'. I would guess that's the issue
because latencies are available for as far as I remember (including C*2.0,
1.2 for sure :)).

Also, be aware that quite a few things changed in the metric structure
between C* 2.1 and C*2.2 (and C*3.0 is similar to C*2.2).

Examples of changes:
- ColumnFamily --> Table
- 99percentile --> p99
- 1MinuteRate -->  m1_rate
- metric name before KS and Table names and some other changes of this kind.
- ^ aggregations / aliases and indexes changed because of this ^ - breaking
most of the charts (in my case at least).
- ‘.value’ is not appended to the metric name anymore for gauges, nothing
instead.

For example (Grafana / Graphite):
From
```aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
2, 3), 1, 7, 8, 9)```
to
```aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
2, 3), 1, 8, 9, 10)```


Another tip, is to use ccm locally (https://github.com/riptano/ccm) for
example and 'jconsole $cassandra_pid'. I use this -->jconsole $(ccm node1
show | grep pid | awk -F= '{print $2}')
Once you're in, you can explore available mbeans and find the metrics
available in 'org.apache.cassandra.[...]'. It's not ideal as you search
'manually' but it allowed me to find some metrics in the past or fix issues
from the doc above.

Out of curiosity, may I ask what backend you used for your monitoring?

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le mer. 29 mai 2019 à 15:32, shalom sagges  a
écrit :

> Hi All,
>
> I'm creating a dashboard that should collect read/write latency metrics on
> C* 3.x.
> In older versions (e.g. 2.0) I used to divide the total read latency in
> microseconds with the read count.
>
> Is there a metric attribute that shows read/write latency without the need
> to do the math, such as in nodetool tablestats "Local read latency" output?
> I saw there's a Mean attribute in org.apache.cassandra.metrics.ReadLatency
> but I'm not sure this is the right one.
>
> I'd really appreciate your help on this one.
> Thanks!
>
>
>

Re: Can sstable corruption cause schema mismatch?

2019-05-29 Thread Alain RODRIGUEZ

Ideas that come mind are:

- Rolling restart of the cluster
- Use of 'nodetool resetlocalschema'  --> function name speaks for itself.
Note that this is to be ran on each node you think is having schema issues
- Are all nodes showing a schema version showing the same one?
- Port not fully open across all nodes?
- Anything in the logs?

Do you know what triggered this situation in the first place?

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 28 mai 2019 à 18:28, Nitan Kainth  a écrit :

> Thank you Alain.
>
> Nodetool describecluster shows some nodes unreachable, different output
> from each node.
> Node1 can see all 4 nodes up.
> Node 2 says node 4 and node 5 unreachable
> Node 3 complains about node node 2 and node 1
>
> Nodetool status shows all nodes up and read writes are working for most
> most operations.
>
> Network looks good. Any other ideas?
>
>
> Regards,
>
> Nitan
>
> Cell: 510 449 9629
>
> On May 28, 2019, at 11:21 AM, Alain RODRIGUEZ  wrote:
>
> Hello Nitan,
>
> 1. Can sstable corruption in application tables cause schema mismatch?
>>
>
> I would say it should not. I could imagine in the case that the corrupted
> table hits some 'system' keyspace sstable. If not I don' see how corrupted
> data can impact the schema on the node.
>
>
>> 2. Do we need to disable repair while adding storage while Cassandra is
>> down?
>>
>
> I think you don't have to, but that it's a good idea.
> Repairs would fail as soon/long as you have a node down that should be
> involved (I think there is an option to change that behaviour now).
> Anyway, stopping repair and restarting it when all nodes are probably
> allows you a better understanding/control of what's going on. Also, it
> reduces the load in time of troubles or maintenance, when the cluster is
> somewhat weaker.
>
> C*heers,
> ---
> Alain Rodriguez - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
> Le mar. 28 mai 2019 à 17:13, Nitan Kainth  a
> écrit :
>
>> Hi,
>>
>> Two questions:
>> 1. Can sstable corruption in application tables cause schema mismatch?
>> 2. Do we need to disable repair while adding storage while Cassandra is
>> down?
>>
>>
>> Regards,
>>
>> Nitan
>>
>> Cell: 510 449 9629
>>
>

Re: Can sstable corruption cause schema mismatch?

2019-05-28 Thread Alain RODRIGUEZ

Hello Nitan,

1. Can sstable corruption in application tables cause schema mismatch?
>

I would say it should not. I could imagine in the case that the corrupted
table hits some 'system' keyspace sstable. If not I don' see how corrupted
data can impact the schema on the node.


> 2. Do we need to disable repair while adding storage while Cassandra is
> down?
>

I think you don't have to, but that it's a good idea.
Repairs would fail as soon/long as you have a node down that should be
involved (I think there is an option to change that behaviour now).
Anyway, stopping repair and restarting it when all nodes are probably
allows you a better understanding/control of what's going on. Also, it
reduces the load in time of troubles or maintenance, when the cluster is
somewhat weaker.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le mar. 28 mai 2019 à 17:13, Nitan Kainth  a écrit :

> Hi,
>
> Two questions:
> 1. Can sstable corruption in application tables cause schema mismatch?
> 2. Do we need to disable repair while adding storage while Cassandra is
> down?
>
>
> Regards,
>
> Nitan
>
> Cell: 510 449 9629
>

Re: Ideal duration for max_hint_window_in_ms

2019-05-27 Thread Alain RODRIGUEZ

Hello Nanda,

what is the impact of increasing the duration of the max_hint_window_in_ms
>

You might want to be aware of the relations between 'max_hint_window_in_ms
', 'gc_grace_seconds' and 'TTLs' to stay away from side effect and have the
desired impact only, my colleague Radovan wrote about this here:
http://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html
.

Other than that, I think the 3 hours were picked when hints were saved into
a system table. Previous design was often leading to hints being stuck,
especially when they were growing too big.
Yet on C*3+, hints are now stored as files. I did not hear about many
issues nowadays. I did not hear about people trying to increase this value
either.

Thus I guess that if you handle the hinted handoff smoothly not to harm
cluster when the node goes back up, you consider side effects of changing
this value as mentioned by Radovan and you are using Cassandra 3+, I guess
you could give it a try. Also keep in mind that hints are an optimization
(as it can be disabled). There is no guarantees delivery for hints (or at
least it was the case before C*3). This (alone) will not 'allow you' to
disable repairs safely.

Now I don't have experience with storing hints longer since they are stored
in files. If you do, you should probably try it in some test cluster first.
But I'd be happy to hear about your experience with it.

---

Also, I think it might be more important to investigate why nodes are going
down and fix this instead/first. More hints might mean more pressure on the
nodes and you might have counter-productive impacts by increasing the hints
storage time.

A couple of random commands to investigate why nodes are going down, maybe
these commands I often use might be of some help to you:

- grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # Anything in the
output there is probably worth your attention. If nodes go down something
should appear here.
- watch -d nodetool tpstats # Here you might use this on worst node at the
worst time to see if any threads are stacking in the 'pending' state. Also
check for 'blocked' and 'dropped'

If you'd like some help with your 'main issue' first, we would need more
details and context.

Hope that any of this is of some help :).

C*heers,
-------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le ven. 24 mai 2019 à 19:59, Nandakishore Tokala <
nandakishore.tok...@gmail.com> a écrit :

> HI All,
>
> what is the impact of increasing the duration of
> the max_hint_window_in_ms, as we are seeing nodes are going down and some
> times we are not bringing them up in 3 hour's and during the repair, we are
> seeing a lot of streaming data, due to the node is down.
>
> so we are planning to increase the max_hint_window_in_ms time so that we
> will less streaming during repair, so is there any drawback in increasing
> the max_hint_window_in_ms?, and what is the ideal time for it(6 hrs, 12
> hrs, 24 hrs)
>
> Thanks
> Nanda
>

Re: schema for testing that has a lot of edge cases

2019-05-27 Thread Alain RODRIGUEZ

Hello Carl,

What you try to do sounds like a good match with one of the tool we
open-sourced and actively maintain:
https://github.com/thelastpickle/tlp-stress.

TLP Stress allows you to use defined profiles (see
https://github.com/thelastpickle/tlp-stress/tree/master/src/main/kotlin/com/thelastpickle/tlpstress/profiles)
or create your own profiles and/or schemas. Contributions are welcome. You
can tune workloads, the read/write ratio, the number of distinct
partitions, number of operations to run...

You might need multiple client to maximize the throughput, depending on
instances in use and your own testing goals.

version specific stuff to 2.1, 2.2, 3.x, 4.x


In case that might be of some use as well, we like to use it combined with
another of our tools: TLP Cluster (
https://github.com/thelastpickle/tlp-cluster). We can the easily create and
destroy Cassandra environments (on AWS) including Cassandra servers, client
and monitoring (Prometheus).

You can have a look anyway, I think both projects might be of interest to
reach your goal.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le jeu. 23 mai 2019 à 21:25, Carl Mueller
 a écrit :

> Does anyone have any schema / schema generation that can be used for
> general testing that has lots of complicated aspects and data?
>
> For example, it has a bunch of different rk/ck variations, column data
> types, altered /added columns and data (which can impact sstables and
> compaction),
>
> Mischeivous data to prepopulate (such as
> https://github.com/minimaxir/big-list-of-naughty-strings for strings,
> ugly keys in maps, semi-evil column names) of sufficient size to get on
> most nodes of a 3-5 node cluster
>
> superwide rows
> large key values
>
> version specific stuff to 2.1, 2.2, 3.x, 4.x
>
> I'd be happy to centralize this in a github if this doesn't exist anywhere
> yet
>
>
>

Re: Optimal Heap Size Cassandra Configuration

2019-05-21 Thread Alain RODRIGUEZ

Hello,

I completely agree with Elliott above on the observation that it is hard to
say what *this cluster *needs. Yet, my colleague Jon wrote a small guide on
how to tune this in most cases or as a starting point let's say.
We often write post were we see repetitive question in the mailing list,
this gives us hints on what are good topics to cover writing a post (not to
repeat here every day the same :)). This was most definitely one of the
most 'demanded' topic and I believe this might be really helpful to you
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html.

No one starts working on garbage collection unless they have to, because
it's for many a different world they don't know about. It was my case, I
did not touch GC for 2 years when I started. But really, you'll see that we
can reason about GC and make things way better in some case. The first time
I changed GC, I divided the cluster size by 2 and still divided latency by
2. So improvement were substantial, it was worth the interest we put into
it.

In addition, to break the ice with GC, I found that using http://gceasy.io
to be an excellent way to monitor/troubleshoot GC. Feed it with some gc
logs and it will give you the GC throughput (% of time JVM is available -
not doing a 'stop the world' pause). To give you some numbers, this should
be > 95-98% minimum. If you are having a lower throughput, chances are
hight that you can 'easily' improve performances there.

There is a lot more details in this analysis that might help you making
your head around GC and tune it properly.

I generally prefer using CMS, but saw some very successful cluster using
G1GC. G1GC is known to work better with bigger heaps. If you're going to
use 8 GB  (or even 16 GB) for the heap, I would stick to CMS and tune it
properly most probably, but again, G1GC might work quite well, with way
less efforts if you can assign 16+GB to the heap let's say.

Work on a canary node (only one random node) while changing this, then
observe logs with GCeasy. Repeat until you're happy with it (I would be
happy with about 95% to 98% of GC throughput (ie 2 to 5 % of pauses). But
what really matters is that after the changes you have better latency/less
dropped messages etc. You can measure the impact in GC throughput. When the
workload seems to be optimised enough / you're tired of playing with GC,
you can apply changes everywhere and observe impact on the cluster
(latency/dropped messages/CPU load...)

Hope that helps and completes somewhat Elliott's excellent answer.
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 20 mai 2019 à 23:31, Elliott Sims  a écrit :

> It's not really something that can be easily calculated based on write
> rate, but more something you have to find empirically and adjust
> periodically.
> Generally speaking, I'd start by running "nodetool gcstats" or similar and
> just see what the GC pause stats look like.  If it's not pausing much or
> for long, you're good.  If it is, you'll likely need to do some tuning
> based on GC logging which may involve increasing the heap but could also
> mean decreasing it or changing the collection strategy.
>
> Generally speaking, with G1GC you can get away with just setting a larger
> heap than you really need and it's close enough to optimal.  CMS is
> theoretically more efficient, but far more complex to get tuned properly
> and tends to fail more dramatically.
>
> On Mon, May 20, 2019 at 7:38 AM Akshay Bhardwaj <
> akshay.bhardwaj1...@gmail.com> wrote:
>
>> Hi Experts,
>>
>> I have a 5 node cluster with 8 core CPU and 32 GiB RAM
>>
>> If I have a write TPS of 5K/s and read TPS of 8K/s, I want to know what
>> is the optimal heap size configuration for each cassandra node.
>>
>> Currently, the heap size is set at 8GB. How can I know if cassandra
>> requires more or less heap memory?
>>
>> Akshay Bhardwaj
>> +91-97111-33849
>>
>

Re: Corrupted sstables

2019-05-10 Thread Alain RODRIGUEZ

Hello Roy,

The name of the table makes me think that you might be doing automated
changes to the schema. I just dug this topic for someone else and schema
changes are way less consistent than standard Cassandra operations (see
https://issues.apache.org/jira/browse/CASSANDRA-10699).

> sessions_rawdata/sessions_v2_2019_05_06-9cae0c20585411e99aa867a11519e31c/md-816-big-I
>
>
Idea 1: Some of these queries might have failed for multiple reasons on a
node (down for too long, race conditions, ...), leaving the cluster in an
unstable state where there is a schema disagreement. In that case, you
could have troubles when adding a new node I have seen it happening. Could
you check/share with us the output of: 'nodetool describecluster'?

Also did you tried recently to perform a rolling restart? This often helps
synchronising local schemas and 'could' fix the issue. Another option is
'nodetool resetlocalschema' on node(s) out of sync.

idea 2: If you identified that you have broken second indexes, maybe give a
try at running 'nodetool rebuild_index   ' on
all nodes before adding the next node?
https://cassandra.apache.org/doc/latest/tools/nodetool/rebuild_index.html

Hope this helps,
C*heers,
-------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le jeu. 9 mai 2019 à 17:29, Jason Wee  a écrit :

> maybe print out value into the logfile and that should lead to some
> clue where it might be the problem?
>
> On Tue, May 7, 2019 at 4:58 PM Paul Chandler  wrote:
> >
> > Roy, We spent along time trying to fix it, but didn’t find a solution,
> it was a test cluster, so we ended up rebuilding the cluster, rather than
> spending anymore time trying to fix the corruption. We have worked out what
> had caused it, so were happy it wasn’t going to occur in production. Sorry
> that is not much help, but I am not even sure it is the same issue you have.
> >
> > Paul
> >
> >
> >
> > On 7 May 2019, at 07:14, Roy Burstein  wrote:
> >
> > I can say that it happens now as well ,currently no node has been
> added/removed .
> > Corrupted sstables are usually the index files and in some machines the
> sstable even does not exist on the filesystem.
> > On one machine I was able to dump the sstable to dump file without any
> issue  . Any idea how to tackle this issue ?
> >
> >
> > On Tue, May 7, 2019 at 12:32 AM Paul Chandler  wrote:
> >>
> >> Roy,
> >>
> >> I have seen this exception before when a column had been dropped then
> re added with the same name but a different type. In particular we dropped
> a column and re created it as static, then had this exception from the old
> sstables created prior to the ddl change.
> >>
> >> Not sure if this applies in your case.
> >>
> >> Thanks
> >>
> >> Paul
> >>
> >> On 6 May 2019, at 21:52, Nitan Kainth  wrote:
> >>
> >> can Disk have bad sectors? fccheck or something similar can help.
> >>
> >> Long shot: repair or any other operation conflicting. Would leave that
> to others.
> >>
> >> On Mon, May 6, 2019 at 3:50 PM Roy Burstein 
> wrote:
> >>>
> >>> It happens on the same column families and they have the same ddl (as
> already posted) . I did not check it after cleanup
> >>> .
> >>>
> >>> On Mon, May 6, 2019, 23:43 Nitan Kainth  wrote:
> >>>>
> >>>> This is strange, never saw this. does it happen to same column family?
> >>>>
> >>>> Does it happen after cleanup?
> >>>>
> >>>> On Mon, May 6, 2019 at 3:41 PM Roy Burstein 
> wrote:
> >>>>>
> >>>>> Yes.
> >>>>>
> >>>>> On Mon, May 6, 2019, 23:23 Nitan Kainth 
> wrote:
> >>>>>>
> >>>>>> Roy,
> >>>>>>
> >>>>>> You mean all nodes show corruption when you add a node to cluster??
> >>>>>>
> >>>>>>
> >>>>>> Regards,
> >>>>>> Nitan
> >>>>>> Cell: 510 449 9629
> >>>>>>
> >>>>>> On May 6, 2019, at 2:48 PM, Roy Burstein 
> wrote:
> >>>>>>
> >>>>>> It happened  on all the servers in the cluster every time I have
> added node
> >>>>>> .
> >>>>>> This is new cluster nothing was upgraded here , we have a similar
> cluster
> >>>>>> running on C* 2.1.15 with no issues .
> >>>>>> We are aware to the scrub utility just it reproduce every time we
> added
> >>>>>> node to the cluster .
> >>>>>>
> >>>>>> We have many tables there
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Schema Management Best Practices

2019-05-10 Thread Alain RODRIGUEZ

Hello Mark

Second,  any ideas what could be creating bottlenecks for schema alteration?


I am not too sure what could be going on to make things that long, but
about the corrupted data, I've seen it before. Here are some thoughts
around schema changes and finding the bottlenecks:

Ideally, use commands to make sure there is no disagreement before moving
with the next schema change. Worst case, *slow down *the path of the
changes. The schema changes queries are not designed to be performant not
to be used asynchronously or so many times in a row. Thus they should and
cannot be used as standard queries hammering the cluster, as of now and
imho. For design reasons, it's not good to run asynchronous / fast path
'alter table' queries. More information about current design issues and
incoming improvement is available on Jira:
https://issues.apache.org/jira/browse/CASSANDRA-9449,
https://issues.apache.org/jira/browse/CASSANDRA-9425,
https://issues.apache.org/jira/browse/CASSANDRA-10699, ...

You might still improve speed if you want to find the bottleneck, use
monitoring dashboards or nodetool:
- 'nodetool describecluster' - show node's schema version.
- 'watch -d nodetool tpstats' --> look for pending / dropped operations
- Check your GC performances if you see a lot of/big GC pauses (maybe
https://gceasy.io might help you there).
- Check the logs (for warn/error - missing columns or mismatching schema)
and the system.schema_column_families to dig deeper and see what each nodes
have as a source of truth.
I hope you'll find some clues you can investigate further with one of those
^.

Also, 'nodetool resetlocalschema' could help if some nodes are stuck with
an old schema version:
http://cassandra.apache.org/doc/latest/tools/nodetool/resetlocalschema.html.
Often a rolling restart also does the trick. If you would need more
specific help it would be good to share the version you are using so we
know where to look at and also to have an idea of the number of nodes could
help.

Some of this stuff is shown here:
https://docs.datastax.com/en/cql/3.3/cql/cql_using/useCreateTableCollisionFix.html.
The existence of this kind of document show that this is a common issue and
it even says clearly: 'Dynamic schema creation or updates can cause schema
collision resulting in errors.'

So for know, I would move slowly, probably automating a check of the output
of 'nodetool describecluster' to make sure the schema was spread before
going for the next mutation.

the number of "alter table" statements was quite large (300+).
>

I must admit I don't know the 'alter table' path well internally, but I
think this is a lot of changes and that it's not designed to happen
quickly. Slow it down and add control procedure in the path I would say.

First, is cqlsh the best way to handle these types of loads?


Yes, I see no problem with that. It could also be through any other
Cassandra client. Maybe they would be faster. I never had to do so many
changes at once :). You can give python or Java a try for this work I
guess. In that case use synchronous requests and automate checks I would
say, to definitely stay away of race conditions / data corruption.


>  Our DBAs report that even under normal conditions they send alter table
> statements in small chunks or else the will see load times of 20-45 minutes.


I also often noticed schema changes take some time, but I did not mind
much. Maybe the comments above that should hopefully keep you away from
race conditions or the use of some other client (Java instead of cqlsh
let's say) might help. I guess you could give it a try.

I would definitely start by reading more (Jira/doc/code) about how
Cassandra perform those changes, if I had to do this kind of batch of
changes, because schema changes do not seem to be as safe and efficient as
most of Cassandra internals are nowadays (for mainstream features, not
counting MVs, Indexes, triggers, etc). This common feature that is to make
multiple changes to your data model quickly should be handled with care and
understanding in Cassandra and for now I would say.

I hope some of the above might be useful to you,

C*heers,
-------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 9 mai 2019 à 13:40, Mark Bidewell  a écrit :

> I am doing post-mortem on an issue with our cassandra cluster.  One of our
> tables became corrupt and had to be restored via a backup.  The table
> schema has been undergoing active development, so the number of "alter
> table" statements was quite large (300+).  Currently, we use cqlsh to do
> schema loads.  During the restore, the schema load alone took about 4 hours.
>
> Our DBAs report that even under normal conditions they send alter table
> statements in small chunks or else the will see load times of 20-45 minutes.
>
> My question is two part.  First, i

Re: Backup Restore

2019-04-26 Thread Alain RODRIGUEZ

Hello Ivan,

Is there a way I can do one command to backup and one to restore a backup?

Handling backups and restore automatically is not an easy task to work on.
It's not straight forward. But it's doable and some did some tools (with
both open source and commercial licences) do this process (or part of it)
for you.

I wrote a post last year, aiming at presenting existing 'ways' to do
backups, I then evaluated and compared them. Even if it's getting old, you
might find it interesting:
http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html

Or the only way is to create snapshots of all the tables and then restore
> one by one?

You can also 'just' copy data over (process described in the post above)
but using snapshots reduces the chances of inconsistencies, especially when
ran at the same time on all nodes.
Also for restore, nothing obliges you to act node by node. When using
restore, it often means the service is off. Restoring all nodes at once is
possible and a good thing to do imho.

Hope that helps!

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 26 avr. 2019 à 05:49, Naman Gupta  a
écrit :

> You should take a look into ssTableLoader cassandra utility,
>
> https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsBulkloader.html
>
> On Fri, Apr 26, 2019 at 1:33 AM Ivan Junckes Filho 
> wrote:
>
>> Hi guys,
>>
>> I am trying do a bakup and restore script in a simple way. Is there a way
>> I can do one command to backup and one to restore a backup?
>>
>> Or the only way is to create snapshots of all the tables and then restore
>> one by one?
>>
>

Re: Can we set the precision of decimal column

2019-04-25 Thread Alain RODRIGUEZ

Hello,



My 2 cents? Do not use floats for money, less for billing. It has never
been good for any database due to the representation of that number for the
computer. It's still the case with Cassandra, and it even seems that
internally we have a debate about how we are doing things there:
https://lists.apache.org/thread.html/5eb0ae8391c1e03bf1926cda11c62a9cc076774bfff73c3014d0d3d4@%3Cdev.cassandra.apache.org%3E

I would use an integer type and create an interface that will multiply the
number by 10 000 where storing it and divide it by the same number when
reading it. This way, no imprecise representation of your number for the
machine anymore. I said 1 arbitrarily because I've never seen a system
that would need more, but it can be 100 or 10 depending on how big the
number you work with are and the precision you want to keep.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le mar. 23 avr. 2019 à 06:04, xiaobo  a écrit :

> Hi,
>
> We usually use decaimal data type for money, can we set decimal columns
> with a specific precision?
>
> Thanks.
>

Re: Assassinate fails

2019-04-05 Thread Alain RODRIGUEZ

Alex,

Well, I tried : rolling restart did not work its magic.
>

Sorry to hear and for misleading you. May faith into the rolling restart
magical power went down a bit, but I still think it was worth a try :D.

> @ Alain : In system.peers I see both the dead node and its replacement
> with the same ID :
>peer | host_id
>   --+--
>192.168.1.18 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
>192.168.1.22 | 09d24557-4e98-44c3-8c9d-53c4c31066e1
>
> Is it expected ?
>
> If I cannot fix this, I think I will add new nodes and remove, one by one,
> the nodes that show the dead node in nodetool status.
>
Well, no. This is clearly not good or expected I would say.

*tl;dr - Suggested fix:*
What I would try to fix this is the following is removing this row. It
*should* be safe but that's only my opinion and with the condition you
remove *only* the 'ghost/dead' nodes. Any mistake here would probably be
costly. Again, be aware you're on a sensitive part when messing with system
tables. Think it twice, check it twice, take a copy of the SSTables/a
snapshot. Then I would go for it and observe changes on one node first. If
no harm is done, continue to the next node.

Considering the old node is '192.168.1.18', I would run this on all nodes
(maybe after testing on a node) to make it simple or just run it on nodes
that show the ghost node(s):

*"DELETE FROM system.peers WHERE peer = '192.168.1.18';"*

Maybe will you need to restart, I think you won't even need it. I have good
hope that this should finally fix your issue with no harm.

*More context - Idea of the problem:*
This above, is clearly an issue I would say. Most probably the source of
your troubles here. The problem is that I lack understanding. From where I
stand, this kind of bugs should not happen anymore in Cassandra (I did not
see anything similar for a while).

I would blame:
- A corner case scenario (unlikely, system tables are rather solid for a
while). Or maybe are you using an old C* version. It *might* be related to
this (or similar): https://issues.apache.org/jira/browse/CASSANDRA-7122)
- A really weird operation (A succession of action might have put you in
this state, but hard for me to say what)
- KairosDB? I don't know It or what it does. Might it be less reliable than
Cassandra is, and have lead to this issue? Maybe, I have no clue once again.

*Risk of this operation and current situation:*
Also, I *think* the current situation is relatively 'stable' (maybe just
some hints being stored for nothing, and possibly not being able to add
more nodes or change schema?). This is the kind of situation where
'rushing' a solution without understanding the impacts and risks can make
things to go terribly wrong. Take the time to analyse my suggested fix,
maybe read the ticket above etc. When you're ready, backup the data,
prepare well the DELETE command and observe how 1 node reacts to the fix
first.

As you can see, I think it's the 'good' fix, but I'm not comfortable with
this operation. And you should not be either :).
I would say, arbitrary to share my feeling about this operation, that there
is 95% chances this does not hurt, 90% chances to fix the issue with that,
but if something goes wrong, if we are in the 5% were it does not go well,
there is a not negligible probability that you will destroy your cluster in
a very bad way. I guess I try to say be careful, watch your step, make sure
you remove the good line, ensure it works on one node with no harm.
I shared my feeling and I would try this fix. But it's ultimately
your responsibility and I won't be behind the machine when you'll fix it.
None of us will.

Good luck ! :)

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le jeu. 4 avr. 2019 à 19:29, Kenneth Brotman 
a écrit :

> Alex,
>
> According to this TLP article
> http://thelastpickle.com/blog/2018/09/18/assassinate.html :
>
> Note that the LEFT status should stick around for 72 hours to ensure all
> nodes come to the consensus that the node has been removed. So please don’t
> rush things if that’s the case. Again, it’s only cosmetic.
>
> If a gossip state will not forget a node that was removed from the cluster
> more than a week ago:
>
> Login to each node within the Cassandra cluster.
> Download jmxterm on each node, if nodetool assassinate is not an
> option.
> Run nodetool assassinate, or the unsafeAssassinateEndpoint command,
> multiple times in quick succession.
> I typically recommend running the command 3-5 times within 2
> seconds.
> I understand that sometimes the command takes time to return, so
> the “2 seconds” suggestion is less of a requirement than it is a mindset.
> Also, sometimes 3-5 times i

Re: Assassinate fails

2019-04-04 Thread Alain RODRIGUEZ

Hi Alex,

About previous advices:

You might have inconsistent data in your system tables.  Try setting the
> consistency level to ALL, then do read query of system tables to force
> repair.
>

System tables use the 'LocalStrategy', thus I don't think any repair would
happen for the system.* tables. Regardless the consistency you use. It
should not harm, but I really think it won't help.

This will sound a little silly but, have you tried rolling the cluster?

The other way around, the rolling restart does not sound that silly to me.
I would try it before touching any other 'deeper' systems. It has indeed
sometimes proven to do some magic for me as well. It's hard to guess on
this kind of ghost node issues without being working on the machine (and
sometimes even when accessing the machine I had some trouble =)). Also a
rolling restart is an operation that should be easy to perform and with low
risk (if everything is well configured).

Other idea to explore:

You can actually select the 'system.peers' table to see if all (other)
nodes are referenced for each node. There should not be any dead nodes in
there. By the way you will see that different nodes have slightly different
data in system.peers, and are not in sync, thus no way to 'repair' that
really.
'Select' is safe. If you delete non-existing 'peers' if any, If the node is
dead anyway, this shouldn't hurt, but make sure you are doing the right
thing you can easily break your cluster from there. I did not see an issue
(a bug) of those for a while though. Normally you should not have to go
that deep touching system tables.

Then also nodes removed should be immediately removed from peers but
persist for some time (7 days maybe?) in the gossip information (normally
as 'LEFT'). This should not create the issue in 'nodetool describecluster'
though.

C*heers,
-------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 4 avr. 2019 à 16:09, Nick Hatfield  a
écrit :

> This will sound a little silly but, have you tried rolling the cluster?
>
>
>
> $> nodetool flush; nodetool drain; service cassandra stop
> $> ps aux | grep ‘cassandra’
>
> # make sure the process actually dies. If not you may need to kill -9
> . Check first to see if nodetool can connect first, nodetool
> gossipinfo. If the connection is live and listening on the port, then just
> try re-running service cassandra stop again. Kill -9 as a last resort
>
> $> service cassandra start
> $> nodetool netstats | grep ‘NORMAL’  # wait for this to return before
> moving on to the next node.
>
>
>
> Restart them all using this method, then run nodetool status again and see
> if it is listed.
>
>
>
> Once other thing, I recall you said something about having to terminate a
> node and then replace it. Make sure that whichever node you did the
> –Dreplace flag on, does not still have it set when you start cassandra on
> it again!
>
>
>
> *From:* Alex [mailto:m...@aca-o.com]
> *Sent:* Thursday, April 04, 2019 4:58 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Assassinate fails
>
>
>
> Hi Anthony,
>
> Thanks for your help.
>
> I tried to run multiple times in quick succession but it fails with :
>
> -- StackTrace --
> java.lang.RuntimeException: Endpoint still alive: /192.168.1.18
> generation changed while trying to assassinate it
> at
> org.apache.cassandra.gms.Gossiper.assassinateEndpoint(Gossiper.java:592)
>
> I can see that the generation number for this node increases by 1 every
> time I call nodetool assassinate ; and the command itself waits for 30
> seconds before assassinating node. When ran multiple times in quick
> succession, the command fails because the generation number has been
> changed by the previous instance.
>
>
>
> In 'nodetool gossipinfo', the node is marked as "LEFT" on every node.
>
> However, in 'nodetool describecluster', this node is marked as
> "unreacheable" on 3 nodes out of 5.
>
>
>
> Alex
>
>
>
> Le 04.04.2019 00:56, Anthony Grasso a écrit :
>
> Hi Alex,
>
>
>
> We wrote a blog post on this topic late last year:
> http://thelastpickle.com/blog/2018/09/18/assassinate.html.
>
>
>
> In short, you will need to run the assassinate command on each node
> simultaneously a number of times in quick succession. This will generate a
> number of messages requesting all nodes completely forget there used to be
> an entry within the gossip state for the given IP address.
>
>
>
> Regards,
>
> Anthony
>
>
>
> On Thu, 4 Apr 2019 at 03:32, Alex  wrote:
>
> Same result it seems:
> Welcome to JMX terminal. Type "help"

Re: Five Questions for Cassandra Users

2019-04-02 Thread Alain RODRIGUEZ

ra, you could really harm your
cluster.

> 5.   Do you use artificial intelligence to help manage your clusters?

So far I only used "human intelligence" (mine, the collective one from this
mailing list and my colleague's one - really often) to manage my and other
people clusters ;-).

But there's a small part what I do that I could trust a machine to do for
me, and better than I would do. There is a lot of tools out there that do
"things" for us (Reaper, OpsCenter, in-house shared/oss tools, Netflix
opens a lot of tools for many years, but also dashboards that you just have
to plug, etc) that bring some intelligence from other people who face, not
real IA per se as it won't learn by itself, maybe. Also, there is an
ongoing work to make operating Cassandra greater, search for "management
tool" off the top of my head...

I never used any IA to help managing my clusters. The closest thing I had
installed at some point was the OpsCenter adviser once, multiple years ago.
But I was knowing more than 'it' (the IA) about Cassandra by then ;-). I
never had the chance to see a really great IA that would actually help me
with cluster management. Thus I see the interest of some tools to help
people managing their cluster.

Other alternatives are fully managed Cassandra clusters services if that's
of interest for you, using the mailing list (as you did), working with
consultant is another option (but I could be a bit biased recommending you
to work with consultants ;-)).

Hope some of it helps (I always write too much ¯\_(ツ)_/¯),

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 1 avr. 2019 à 20:45, Rahul Singh  a
écrit :

> Answers inline.
>
>
> 1.   Do the same people where you work operate the cluster and write
> the code to develop the application?
>
>
> No but the operators need to know development , data-modeling, and
> generally how to "code" the application. (Coding is a low-level task of
> assigning a code to a concept.. so I don't think that's the proper verb in
> these scenarios.. engineering, or software development, or even programing
> is a better term). It's because the developers are hired dime a dozen at
> the B / C level and then replaced by D /E / F level developers as things go
> on.. so the Data team eventually ends up being the expert of the
> application and the data platform, and a "Center of Excellence" for the
> development / architects to work with on a collaborative basis.
>
>
>
> 2.   Do you have a metrics stack that allows you to see graphs of
> various metrics with all the nodes displayed together?
>
>
>
> Yes. OpsCenter, ELK, Grafana, custom node data visualizers in excel
> (because lines and charts don't tell you everything)
>
>
> 3.   Do you have a log stack that allows you to see the logs for all
> the nodes together?
>
> ELK. CloudWatch
>
>
> 4.   Do you regularly repair your clusters - such as by using Reaper?
>
>  Depends. Cron, Reaper, OpsCenter Repair, and now NodeSync
>
>
> 5.   Do you use artificial intelligence to help manage your clusters?
>
>
> Yes, I actually have made an artificial general intelligence called
> Gravitron. It learns by ingesting all the news articles I aggregate about
> Cassandra and the links I curate on cassandra.link into a solr/lucene index
> and then using clustering find out the most popular and popularly connected
> content. Once it does that there's a summarization of the content into
> human readable content as well as interpreted bash code that gets pushed
> into a "Recipe Book." As the master operator identifies scenarios using
> english language, and then runs the bash commands, the machine slowly but
> surely "wakes up" and starts to manage itself. It can also play Go , the
> game, and beat IBM's AlphaGo at Go, and Donald Trump at golf while he was
> cheating!
>
>
>
> rahul.xavier.si...@gmail.com
>
> http://cassandra.link
>
> I'm speaking at #DataStaxAccelerate, the world’s premiere #ApacheCassandra
> conference, and I want to see you there! Use my code Singh50 for 50% off
> your registration. www.datastax.com/accelerate
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Happy april fools day.
>
>
>
>
>
> On Thu, Mar 28, 2019 at 5:03 AM Kenneth Brotman
>  wrote:
>
>> I’m looking to get a better feel for how people use Cassandra in
>> practice.  I thought others would benefit as well so may I ask you the
>> following five questions:
>>
>>
>>
>> 1.   Do the s

Re: Best practices while designing backup storage system for big Cassandra cluster

2019-04-01 Thread Alain RODRIGUEZ

Hello Manish,

I think any disk works. As long as it is big enough. It's also better if
it's a reliable system (some kind of redundant raid, NAS, storage like GCS
or S3...). We are not looking for speed mostly during a backup, but
resiliency and not harming the source cluster mostly I would say.
Then how fast you write to the backup storage system will probably be more
often limited by what you can read from the source cluster.
The backups have to be taken from running nodes, thus it's easy to overload
the disk (reads), network (export backup data to final destination), and
even CPU (as/if the machine handles the transfer).

What are the best practices while designing backup storage system for big
> Cassandra cluster?


What is nice to have (not to say mandatory) is a system of incremental
backups. You should not take the data from the nodes every time, or you'll
either harm the cluster regularly OR spend days to transfer the data (if
the amount of data grows big enough).
I'm not speaking about Cassandra incremental snapshots, but of using
something like AWS Snapshot, or copying this behaviour programmatically to
take (copy, link?) old SSTables from previous backups when they exist, will
greatly unload the clusters work and the resource needed as soon enough a
substantial amount of the data should be coming from the backup data source
itself. The problem with incremental snapshot is that when restoring, you
have to restore multiple pieces, making it harder and involving a lot of
compaction work.
The "caching" technic mentioned above gives the best of the 2 worlds:
- You will always backup from the nodes only the sstables you don’t have
already in your backup storage system,
- You will always restore easily as each backup is a full backup.

It's not really a "hands-on" writing, but this should let you know about
existing ways to do backups and the tradeoffs, I wrote this a year ago:
http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html
.

It's a complex topic, I hope some of this is helpful to you.

C*heers,
-------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le jeu. 28 mars 2019 à 11:24, manish khandelwal <
manishkhandelwa...@gmail.com> a écrit :

> Hi
>
>
>
> I would like to know is there any guideline for selecting storage device
> (disk type) for Cassandra backups.
>
>
>
> As per my current observation, NearLine (NL) disk on SAN  slows down
> significantly while copying backup files (taking full backup) from all node
> simultaneously. Will using SSD disk on SAN help us in this regard?
>
> Apart from using SSD disk, what are the alternative approach to make my
> backup process fast?
>
> What are the best practices while designing backup storage system for big
> Cassandra cluster?
>
>
> Regards
>
> Manish
>

Re: Cassandra collection tombstones

2019-01-28 Thread Alain RODRIGUEZ

 place, that will probably make
things way nicer/easier.
The person in the post above used a json as his way out of this problem.
In the past, I kept the collection but changed queries from the 'insert' to
'update' from here: http://cassandra.apache.org/doc/4.0/cql/types.html#id5.
Using this form worked without creating tombstones if I remember correctly.
I would guess this is because when you 'update'  you accept that the
previously set values in the map remain unchanged if they are not
specified, that's why Cassandra doesn't have to 'clean' first. For inserts,
it would be weird to insert a map and find 2 columns from the previous
'insert', Cassandra does not read, it cleans and writes on top.

Good luck,

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le dim. 27 janv. 2019 à 09:02, Ayub M  a écrit :

> Thanks Alain/Chris.
>
> Firstly I am not seeing any difference when using gc_grace_seconds with
> sstablemetadata.
>
> CREATE TABLE ks.nmtest (
> reservation_id text,
> order_id text,
> c1 int,
> order_details map,
> PRIMARY KEY (reservation_id, order_id)
> ) WITH CLUSTERING ORDER BY (order_id ASC)
> AND bloom_filter_fp_chance = 0.01
> AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
> AND comment = ''
> AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
> AND compression = {'chunk_length_in_kb': '64', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
> AND crc_check_chance = 1.0
> AND dclocal_read_repair_chance = 0.1
> AND default_time_to_live = 0
> AND gc_grace_seconds = 86400
> AND max_index_interval = 2048
> AND memtable_flush_period_in_ms = 0
> AND min_index_interval = 128
> AND read_repair_chance = 0.0
> AND speculative_retry = '99PERCENTILE';
>
> [root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
> sstabledump mc-11-big-Data.db
> WARN  08:27:32,793 memtable_cleanup_threshold has been deprecated and
> should be removed from cassandra.yaml
> [
>   {
> "partition" : {
>   "key" : [ "4" ],
>   "position" : 0
> },
> "rows" : [
>   {
> "type" : "row",
> "position" : 40,
> "clustering" : [ "4" ],
> "cells" : [
>   { "name" : "order_details", "path" : [ "key1" ], "value" :
> "value1", "tstamp" : "2019-01-27T08:26:49.633240Z" }
> ]
>   }
> ]
>   },
>   {
> "partition" : {
>   "key" : [ "5" ],
>   "position" : 41
> },
> "rows" : [
>   {
> "type" : "row",
> "position" : 82,
> "clustering" : [ "5" ],
> "liveness_info" : { "tstamp" : "2019-01-27T08:23:29.782506Z" },
> "cells" : [
>   { "name" : "c1", "value" : 5 },
>   { "name" : "order_details", "deletion_info" : { "marked_deleted"
> : "2019-01-27T08:23:29.782505Z", "local_delete_time" :
> "2019-01-27T08:23:29Z" } },
>   { "name" : "order_details", "path" : [ "key" ], "value" :
> "value" }
> ]
>   }
> ]
>   }
>
> Partition 5 is a newly inserted record, no matter what gc_grace_seconds
> value I pass it still shows this record as estimated tombstone.
>
> [root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
> sstablemetadata mc-11-big-Data.db | grep "Estimated tombstone drop times"
> -A3
> Estimated tombstone drop times:
> 1548577440: 1
> Count   Row SizeCell Count
>
> [root@ip-xxx-xxx-xxx-xxx nmtest-e1302500201d11e983bb693c02c04c62]#
> sstablemetadata  --gc_grace_seconds 86400 mc-11-big-Data.db | grep
> "Estimated tombstone drop times" -A4
> Estimated tombstone drop times:
> 1548577440: 1
> Count   Row SizeCell Count
>
> Second question, for this test table I see the original record which I
> inserted got its tombstone removed by autocompaction which ran today as its
> gc_grace_seconds is set to one day. But I see some tables whose
> gc_grace_seconds is set to 3 days but when I do sstabledump on them I see

Re: Cassandra collection tombstones

2019-01-25 Thread Alain RODRIGUEZ

Hello,

I think you might be inserting on the top of an existing collection,
implicitly, Cassandra creates a range tombstone. Cassandra does not
update/delete data, it always inserts (data or tombstone). Then eventually
compaction merges the data and evict the tombstones. Thus, when overwriting
an entire collection, Cassandra performs a delete first under the hood.

I wrote about this, in this post about 2 years ago, in the middle of this
(long) article:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Here is the part that might be of interest in your case:

"Note: When using collections, range tombstones will be generated by INSERT
and UPDATE operations every time you are using an entire collection, and
not updating parts of it. Inserting a collection over an existing
collection, rather than appending it or updating only an item in it, leads
to range tombstones insert followed by the insert of the new values for the
collection. This DELETE operation is hidden leading to some weird and
frustrating tombstones issues."

and

"From the mailing list I found out that James Ravn posted about this topic
using list example, but it is true for all the collections, so I won’t go
through more details, I just wanted to point this out as it can be
surprising, see:
http://www.jsravn.com/2015/05/13/cassandra-tombstones-collections.html#lists
"

Thus to specifically answer your questions:

 Does this tombstone ever get removed?


Yes, after gc_grace_seconds (table option) happened AND if the data that is
shadowed by the tombstone is also part of the same compaction (all the
previous shards need to be there if I remember correctly). So yes, but
eventually, not immediately nor any time soon (10+ days by default).


> Also when I run sstablemetadata on the only sstable, it shows "Estimated
> droppable tombstones" as 0.5", Similarly it shows one record with epoch
> time as insert time for - "Estimated tombstone drop times: 1548384720: 1".
> Does it mean that when I do sstablemetadata on a table having collections,
> the estimated droppable tombstone ratio and drop times values are not true
> and dependable values due to collection/list range tombstones?


I do not remember this precisely but you can check the code, it's worth
having a look. The "estimated droppable tombstone" value is actually always
wrong. Because it's an estimate that does not consider overlaps (and I'm
not sure about the fact it considers the gc_grace_seconds either). But also
because calculation does not count a certain type of tombstones and the
weight of range tombstones compared to the tombstone cells makes the count
quite inaccurate:
http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html
.

I think this evolved since I looked at it and might not remember well, but
this value is definitely not accurate.

If you're re-inserting a collection for a given existing partition often,
there is probably plenty of tombstones sitting around though, that's almost
guaranteed.

Does tombstone_threshold of compaction depend on the sstablemetadata
> threshold value? If so then for tables having collections, this is not a
> true threshold right?
>

Yes, I believe the tombstone threshold actually uses the "estimated
droppable tombstone" value to chose to trigger or not a
"single-SSTable"/"tombstone" compaction. Yet, in your case, this will not
clean the tombstones in the first 10 days at least (gc_grace_seconds
default value). Compactions do not keep triggering because there is a
minimum interval defined between 2 tombstones compactions of an SSTable (1
day by default). This setting is keeping you away from a useless compaction
loop most probably, I would not try to change this. Collection or not
collection does not change how the compaction strategy operates.

I faced this in the past. Operationally you can have things working, but
it's hard and really pointless (it was in my case at least). I would
definitely recommend changing the model to update parts of the map and
never rewrite a map.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 25 janv. 2019 à 05:29, Ayub M  a écrit :

> I have created a table with a collection. Inserted a record and took
> sstabledump of it and seeing there is range tombstone for it in the
> sstable. Does this tombstone ever get removed? Also when I run
> sstablemetadata on the only sstable, it shows "Estimated droppable
> tombstones" as 0.5", Similarly it shows one record with epoch time as
> insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it
> mean that when I do sstablemetadata on a table having collections, the
> estimated droppable tombstone ratio and drop tim

Re: question about the gain of increasing the number of vnode

2019-01-21 Thread Alain RODRIGUEZ

Sure, it's called "Cassandra Availability with Virtual Nodes”, by Joey
Lynch and Josh Snyder.

I found it in the mailing list archives:
https://github.com/jolynch/python_performance_toolkit/blob/master/notebooks/cassandra_availability/whitepaper/cassandra-availability-virtual.pdf

There are some maths in there to explain impacts of the number of vnodes on
availability.

Using the formula "1d", and considering a datacenter of 3 balanced racks
with RF = 3, we have:

Np*(1-(1-(1/Np))^(v*2*(R-1)) = 40*(1-(1-(1/40))^(256*2*(3-1)) =
39.98
Thus if my calculation is accurate, with 60 nodes and 256 vnodes, we expect
a node to have 39.98 neighbors. This means that with 60 nodes, *each
node* has 40 *possible* replicas (all the nodes in other racks) and will be
sharing a token range with all the other nodes. Thus 2 nodes down in
distinct racks and you have an outage almost ensured (still needs 2 nodes
down).

Some other arbitrary numbers that show the evolution of this value
depending on the number of nodes and vnodes.

- With 60 nodes and 256 vnodes, expect 39.98 neighbors
- With 60 nodes and 16 vnodes, expect 32.0867407145 neighbors
- With 60 nodes and 4 vnodes, expect 13.323193263 neighbors

- With 300 nodes and 256 vnodes, expect 198.8200470802 neighbors
- With 300 nodes and 16 vnodes, expect 54.8867183963 neighbors
- With 300 nodes and 4 vnodes, expect 15.4137752052 neighbors


Good reading :).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 21 janv. 2019 à 13:30, VICTOR IBARRA  a écrit :

> Hi Alain ,
>
>  thank you very much for the explication and the points for the sujet of
> managing de vnodes
>
> you talk about the paper of netflix and the outage ?  you have the link
> with this discution
>
> thank you for your help
> BEST REGARDS
>
> Le lun. 21 janv. 2019 à 13:53, Alain RODRIGUEZ  a
> écrit :
>
>> There have been some discussion on this topic in this mailing list,
>> including a paper from Netflix with the impact of vnodes. I could not find
>> it quickly, but I invite you to check.
>>
>> To share some ideas:
>>
>> More vnodes:
>> + Better balance between nodes
>> + maximize the streaming throughput for operations as all nodes share a
>> small bit of the data of all the other nodes (according to the topology).
>> - When the cluster fails, there is more chance to lose availability as we
>> 256 vnodes for example, 2 nodes down in distinct racks would for sure make
>> data partially unavailable.
>> - Overheads / Operational issues (in practice, using 256 vnodes have been
>> a nightmare for multiple reasons, see below)
>>
>>
>> Less vnodes
>> - Imbalances can be big before C* 3.0. After, using
>> allocate_tokens_for_keyspace -->
>> http://cassandra.apache.org/doc/4.0/configuration/cassandra_config_file.html#allocate-tokens-for-keyspace,
>> you can mitigate this issue. With this and some technics*, you can have
>> good results in terms of balances.
>> * Off the top of my head: this involve bootstrapping the seeds first,
>> picking the tokens to use, create your keyspace then adding nodes with the
>> option above. You can test it quite easily. Then with "nodetool status
>> > - The streaming throughput is generally limited by the receiving host
>> when using vnodes, thus 16 vnodes is probably not worse than 256 in terms
>> of streaming
>> + The other way around, the overhead of having 256 vnodes makes
>> operations such as repair almost impossible, or at least way longer and
>> complex. Repairing tables almost empty can take up to minutes and repairing
>> big dataset might never end.
>> + In Netflix paper about this topic (very interesting, I recommend
>> reading), it is explained that reducing the number of vnodes reduces the
>> chances of an outage.
>> + There was a discussion in the dev mailing list. I believe the community
>> agreed on the need to reduce the number of vnodes by default. Here again,
>> you can have a quick look at the archive, Jira, github/trunk.
>>
>> I think that commonly accepted values would be 16/32. Values as low as 4
>> are considered to improve availability, reduce overheads induced by vnodes.
>> I would suggest you test it and see if low values you still manage to keep
>> the balance between nodes.
>>
>> Also using "physical" nodes (initial_token, no vnodes) gives the
>> possibility to reason about token distribution. You can perform advanced 
>> operations
>> where you bootstrap 1/3 of the cluster at once. This is very good
>> especially for big clusters, I woul

Re: question about the gain of increasing the number of vnode

2019-01-21 Thread Alain RODRIGUEZ

There have been some discussion on this topic in this mailing list,
including a paper from Netflix with the impact of vnodes. I could not find
it quickly, but I invite you to check.

To share some ideas:

More vnodes:
+ Better balance between nodes
+ maximize the streaming throughput for operations as all nodes share a
small bit of the data of all the other nodes (according to the topology).
- When the cluster fails, there is more chance to lose availability as we
256 vnodes for example, 2 nodes down in distinct racks would for sure make
data partially unavailable.
- Overheads / Operational issues (in practice, using 256 vnodes have been a
nightmare for multiple reasons, see below)


Less vnodes
- Imbalances can be big before C* 3.0. After, using
allocate_tokens_for_keyspace -->
http://cassandra.apache.org/doc/4.0/configuration/cassandra_config_file.html#allocate-tokens-for-keyspace,
you can mitigate this issue. With this and some technics*, you can have
good results in terms of balances.
* Off the top of my head: this involve bootstrapping the seeds first,
picking the tokens to use, create your keyspace then adding nodes with the
option above. You can test it quite easily. Then with "nodetool status
http://www.thelastpickle.com


Le lun. 21 janv. 2019 à 11:08, VICTOR IBARRA  a écrit :

>
> Good morning every one,
>
> I would like have a contact with the cassandra community for the questions
> of cluster configuration
>
> Today i have many questions and differents projets about the configuration
> of cluster cassandra and with the general problems of configuration
> migration and for the use of vnodes.
>
> and the principal question is what about the gain to use 256 vnodes vs 16
> vnodes for example
>
> Best regards
> --
>  L'integrité de ce message n'étant pas assurée sur internet, VICTOR IBARRA
> ne peut être tenue responsable de son contenu en ce compris les pièces
> jointes. Toute utilisation ou diffusion non autorisée est interdite. Si
> vous n'êtes pas destinataire de ce message, merci de le  détruire et
> d'avertir l'expéditeur.
>
>  The integrity of this message cannot be guaranteed on the Internet.
> VICTOR IBARRA can not therefore be considered liable for the  contents
> including its attachments. Any unauthorized use or dissemination is
> prohibited. If you are not the intended recipient of  this message, then
> please delete it and notify the sender.
>

Re: How can I limit the non-heap memory for Cassandra

2019-01-18 Thread Alain RODRIGUEZ

Hello Chris,

I must admit I am a bit confused about what you need exactly, I'll try to
do my best :).


> would like to place limits on it to avoid it becoming a “noisy neighbor”

But we also don’t want it killed by the oom killer, so just placing limits
> on the container won't help.


This sound contradictory to me. When the available memory is fully used and
the memory cannot be taken elsewhere the program cannot continue and OOM
cannot be avoided. So it's probably one thing or the other I would say
¯\_(ツ)_/¯.

Is there’s a way to limit Cassandra’s off-heap memory usage?


If we consider this perspective: "cointainerMemory = NativeMemory +
HeapMemory",
Then controlling the JVM heap memory and the container memory, you also
control the off-heap/native memory. So practically yes, you can set the
off-heap memory size, not directly, but by reducing the JVM heap size.
The option is in jvm.options (or cassandra-env.sh): MAX_HEAP_SIZE="X"  (it
comes with some tradeoff as well of course, if you're going this path I
recommend this post from Jon about JVM/GarbageCollection which is a tricky
piece of Cassandra operations
http://thelastpickle.com/blog/2018/04/11/gc-tuning.html)

You can also control each (most?) of the off-heap structures individually.
It's a bit split here and there between the distinct configuration files
and at the table level.
For example, if you're running out of Native Memory, you can maybe:

- Consider adding RAM or use a bigger instance type in the cloud.
- Reduce bloom filters ? -->
http://cassandra.apache.org/doc/latest/operating/bloom_filters.html?highlight=bloom_filter_fp_chance#bloom-filters
- Disable Row caches ? If you have troubles with memory, I would start
there probably (You did not give us your version of Cassandra though).
- Reduce the max_index_interval? -->
https://docs.datastax.com/en/cql/3.3/cql/cql_reference/cqlCreateTable.html#tabProp__cqlTableMax_index_interval
- ...

*Long Story :)*

 It's the C* operator job to ensure that the hardware choice and usage is
optimal or at least that the sum of the memory needed by off-heap
structures stays below what's available and does not produce any OOM. Some
structures are capped (like the key cache size) some other will grow with
the data (like bloom filters). Thus it's good to have some breathing room
for growth and to have a monitoring system in place (this is something I
advocate for at any occasion :D). Finding the right balance is part of the
job of many of us here around :).

That being said, it's rare we are fighting this kind of OOM because, in a
huge majority of the cluster, we rely strongly on page caching and we try
to have as much possible 'free' native memory for that purpose. We run into
problems way before running out of native memory in many cases.

Generally, a Cassandra cluster with the recommended (64 GB of RAM maybe?)
or at least decent (32 GB?) hardware and the default configuration should
hopefully work nicely. The schema design might make things worse and of
course, you can optimize and reduce the cost, sometimes in a
substantial way. But globally Cassandra and the default configuration give
a good starting point I think.

One last thing is that the more details you share, the sharper and accurate
our answers can be. I feel like I told you *everything* I know about memory
because it wasn't clear to me what you needed :). Specifying the Cassandra
version, and telling something about your specific case like the memory
total size or the JVM configuration would probably induce (faster/better)
responses from us :).

I hope this still helps.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com





Le jeu. 3 janv. 2019 à 00:03, Chris Mildebrandt  a
écrit :

> Hi,
>
> Is there’s a way to limit Cassandra’s off-heap memory usage? I can’t find
> a way to limit the memory used for row caches, bloom filters, etc. We’re
> running Cassandra in a container and would like to place limits on it to
> avoid it becoming a “noisy neighbor”. But we also don’t want it killed by
> the oom killer, so just placing limits on the container won't help.
>
> Thanks,
> -Chris
>

Re: Cassandra 2.1 bootstrap - No streaming progress from one node

2018-11-26 Thread Alain RODRIGUEZ

Hello,

+1 with Sean above.
In Cassandra 2.2 you got the new 'nodetool resume' command to resume a
bootstrap. For C*2.1, nothing equivalent sadly. Thus old school technics
apply and the possible alternatives are

Option 1 - safe and slow- Stop Cassandra from the stuck joining node.
Remove everything (comitlog/data), restart bootstrap, mentioned by Sean
above.
Option 2 - More wild / hopefully quicker - Stop the joining node and start
it with 'auto_bootstrap: false'. The node joins with missing data, that you
can repair after. Yet this can be done without inconsistencies only using a
strong consistency (CL.R+CL.W > RF). This node will be read from but is
then not enough to induce a stale read.

Unless you're confident or have a huge amount of data and most of it made
it to the new node already, I would stick with the option 1, safe and slow
(and upgrade soon ;-))

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mer. 7 nov. 2018 à 19:22, Durity, Sean R  a
écrit :

> I would wipe the new node and bootstrap again. I do not know of any way to
> resume the streaming that was previously in progress.
>
>
>
>
>
> Sean Durity
>
> *From:* Steinmaurer, Thomas 
> *Sent:* Wednesday, November 07, 2018 5:13 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Cassandra 2.1 bootstrap - No streaming progress
> from one node
>
>
>
> Hello,
>
>
>
> while bootstrapping a new node into an existing cluster, a node which is
> acting as source for streaming got restarted unfortunately. Since then,
> from nodetool netstats I don’t see any progress for this particular node
> anymore.
>
>
>
> E.g.:
>
>
>
> /X.X.X.X
>
> Receiving 94 files, 260.09 GB total. Already received 26 files,
> 69.33 GB total
>
>
>
> Basically, it is stuck at 69.33GB for hours. Is Cassandra (2.1 in our
> case) not doing any resume here, in case there have been e.g. connectivity
> troubles or in our case, Cassandra on the node acting as stream source got
> restarted?
>
>
>
> Can I force the joining node to recover connection to X.X.X.X or do I need
> to restart the bootstrap via restart on the new node from scratch?
>
>
>
> Thanks,
>
> Thomas
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH (registration
> number FN 91482h) is a company registered in Linz whose registered office
> is at 4040 Linz, Austria, Freistädterstraße 313
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>

Re: Exception when running sstableloader

2018-11-26 Thread Alain RODRIGUEZ

Hello LAD,

I do not know much about the SSTable Loader. I carefully stayed away from
it so far :). But it seems it's using thrift to talk to Cassandra.

Some of your rows might be too big and increasing
'thrift_framed_transport_size_in_mb' should have helped indeed.

Did you / Would you try with increasing this as well:
'thrift_max_message_length_in_mb' and see what happens?

Cheers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le lun. 5 nov. 2018 à 18:00, Kalyan Chakravarthy  a
écrit :

> I’m trying to migrate data between two clusters on different networks.
> Ports: 7001,7199,9046,9160 are open between them. But port:7000 is not
> open. When I run sstableloader command, got the following exception.
> Command:
>
> :/a/cassandra/bin# ./sstableloader -d
> 192.168.98.99/abc/cassandra/data/apps/ads-0fdd9ff0a7d711e89107ff9c3da22254
>
> Error/Exception:
>
> Could not retrieve endpoint ranges:
> org.apache.thrift.transport.TTransportException: *Frame size (352518912)
> larger than max length (15728640)!*
> java.lang.RuntimeException: Could not retrieve endpoint ranges:
> at
> org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:342)
> at
> org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:156)
> at org.apache.cassandra.tools.BulkLoader.main(BulkLoader.java:109)
> Caused by: org.apache.thrift.transport.TTransportException: Frame size
> (352518912) larger than max length (15728640)!
> at
> org.apache.thrift.transport.TFramedTransport.readFrame(TFramedTransport.java:137)
> at
> org.apache.thrift.transport.TFramedTransport.read(TFramedTransport.java:101)
> at
> org.apache.thrift.transport.TTransport.readAll(TTransport.java:86)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318)
> at
> org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219)
> at
> org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69)
> at
> org.apache.cassandra.thrift.Cassandra$Client.recv_describe_partitioner(Cassandra.java:1368)
> at
> org.apache.cassandra.thrift.Cassandra$Client.describe_partitioner(Cassandra.java:1356)
> at
> org.apache.cassandra.tools.BulkLoader$ExternalClient.init(BulkLoader.java:304)
> ... 2 more
>
>
>
>
> In yaml file,’ thrift_framed_transport_size_in_mb:’ is set to 15. So I
> have increased its value to 40. Even after increasing the
> ‘thrift_framed_transport_size_in_mb: ‘ in yaml file, I’m getting the same
> error.
>
> What could be the solution for this. Can somebody please help me with
> this??
>
> Cheers
> LAD
>

Re: Problem with hints

2018-11-26 Thread Alain RODRIGUEZ

Hello,

It's hard to say what the best configuration is.

*max_hints_delivery_threads*:  I think it's said you should use 2 delivery
thread per DC (as a starting point).
*max_hint_window_in_ms:* depends on your need (and what the cluster can
bear of course). Generally, the default of 3 hours is used.
*hinted_handoff_throttle_in_kb:* this one is a trickier one. You want to
reduce it for hints delivery not to be harmful. See the description here:
http://cassandra.apache.org/doc/4.0/configuration/cassandra_config_file.html?highlight=max_hints_delivery_threads#hinted-handoff-throttle-in-kb

*Notes:*
- if the cluster crashed, it might not be directly due to hints, but to the
pressure on the remaining nodes, that is too big for nodes to cope with.
- C*3+ (off the top of my head) is storing hints in files rather than using
a system table. This should be way more reliable. If you're still using a
version of Cassandra storing hints in the system table, you might want to
upgrade?

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 1 nov. 2018 à 16:22, Ghazi Naceur 
a écrit :

> Hello everyone,
>
> I have a problem with the hints in my cluster.
> In fact, my cluster is composed of 5 nodes : 3 nodes in DC1 and 2 nodes in
> DC2.
> The size of the hints folder increases rapidly and it reaches nearly the
> 90% of the hard drive (50 Gb).
> The test scenario is the following : I applied 1000 UPDATE requests per
> second on the first DC. I wanted to make this test lasts 3 hours. The first
> node of the second DC stopped, and after few minutes, the second node
> stopped as well. The hints files started to increase rapidly and consumed a
> big part of my hard drive.
> This test fails after few minutes.
> The replication factor is 2 for both DCs and the consistency level is
> LOCAL_ONE.
> Could you please tell me what is the best configuration of the hints
> properties : hinted_handoff_throttle_in_kb, max_hint_window_in_ms and
> max_hints_delivery_threads ?
>
> Best regards.
>

Re: View slow queries in V3

2018-11-26 Thread Alain RODRIGUEZ

Hello,


I just saw this message still has no response.

Do we have any inbuilt features to log slow\resource heavy queries?


I think we don't have this in Cassandra yet. I also believe it's in DSE, if
it's something you'd really need :).

I tried to check data in system_traces.sessions to check current running
> sessions but even that does not have any data.
>

This is expected. Those tables ('sessions' and 'events') are empty unless
you run: `nodetool settraceprobability 0.001` or similar.
Here you say: set the probability to actually trace a query to 0.1%.

This is off by default and can do a lot of damage if set to a high value as
each query being tracked will do more writes in Cassandra. Move up
incrementally with tracing until you go enough data representative enough
of the problems you're facing.
That being said, it's very rare I need to trace queries to understand
what's wrong with the cluster. Generally, I  find a lot of useful
information in one of those when I'm facing issues:
- `nodetool tpstats` - Any pending / dropped requests?
- `grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log` - Any error or
warning explaining faced issues? Check here if:
--  tombstones might be an issue
-- Large partitions are being compacted (how big?)
- `nodetool (proxy)histograms` to see the latencies, number of
sstables touched per read and more stuff.


I'm asking this because I can see 2 nodes in my cluster going Out Of Memory
> multiple times but we're not able to find the reason for it. I'm suspecting
> that it is a heavy read query but can't find any logs for it.


Ah. Is the Out of Memory inside the heap or of the native memory?

If it's a heap issue, tuning GC is probably the way to go and this
information might help:
- http://thelastpickle.com/blog/2018/04/11/gc-tuning.html
- https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

If it's out of the heap, you're probably not leaving enough memory to
Cassandra (or using a C* version with a memory leak (very early 3.11 had
this I believe):
(all come with distinct consequences/tradeoffs of course)
- Reduce the heap size?
- Reduce the sizes of indexes by increasing the `min/max_index_interval` of
the biggest tables?
- Use more memory?
- Reduce `bloom_filter_fp_chance` ?

Finally, I'm a big fan of Monitoring. With the proper monitroring system
AND dashboards in place, you would probably see what's wrong at first
sight. Then understanding it or fixing it can take some time, but
Monitoring makes it really easy to see 'something' is wrong, a spike
somewhere in some chart and start digging from there. Many providers are
now offering Dashboards out of the box for Cassandra (I worked on Datadog
ones, but other tools have it such as Sematext SPM). Also, on open source
tools Criteo released Cassandra Monitoring related systems working on the
top of Prometheus. People also use Grafana/Graphite and other standard
tools. You might find nice dashboards there too.

I hope this will help (if it's not too late :)).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 1 nov. 2018 à 15:18, Dipan Shah  a écrit :

> Hi All,
>
> Do we have any inbuilt features to log slow\resource heavy queries?
>
> I tried to check data in system_traces.sessions to check current running
> sessions but even that does not have any data.
>
> I'm asking this because I can see 2 nodes in my cluster going Out Of
> Memory multiple times but we're not able to find the reason for it. I'm
> suspecting that it is a heavy read query but can't find any logs for it.
>
> Thanks,
> Dipan
>

Re: snapshot strategy?

2018-11-21 Thread Alain RODRIGUEZ

Hi Lou,

My apologies for the delay answering here, and for my misunderstanding in
the first place.

we want to keep snapshots that are no older than X days old while purging
> the older ones
>

I think that you can achieve this with some code, out of Cassandra or
interacting with Cassandra, using bash, python or whatever you like and
works for you :).

(a) those created on demand by given name and (b) those create
> automatically, for example as a result of a TRUNCATE, that do not have a
> well known name. To get rid of the given name ones (a) seems straight
> forward. How do I locate and get rid of the automatically created  (b) ones?
>

Snapshots available in 'data/tablename-cfid/snapshots/'. You can get rid of
the snapshots (b), and make sure to keep snapshots (a) (if you want to)
checking the date of creation of each directory (ie snapshot) in there.
Then check the name 'format' maybe with a regex or a using a prefix for
'manual' snapshot (a)? Whatever works well for you again.
Finally, run 'nodetool clearsnapshot -t ' (you can also
specify the keyspace and table here if needed)

What I try to say is that I would probably try to solve this issue outside
of Cassandra, only invoking nodetool.

Other option, I think you can just remove the old snapshots with 'rm -r
/path_to_data/tablename-cfid/snapshots/'. You can probably do
that using something like
https://stackoverflow.com/questions/13868821/shell-script-to-delete-directories-older-than-n-days
.

I hope I answered closer to your expectations this time :)

C*heers,
-------
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le lun. 5 nov. 2018 à 21:09, Lou DeGenaro  a écrit :

> Alain,
>
> Thanks for the suggestion, but I think I did not make myself clear.  In
> order to utilize disk space efficiently, we want to keep snapshots that are
> no older than X days old while purging the older ones.   My understanding
> is that there are 2 kinds of snapshots :  (a) those created on demand by
> given name and (b) those create automatically, for example as a result of a
> TRUNCATE, that do not have a well known name. To get rid of the given name
> ones (a) seems straight forward.  How do I locate and get rid of the
> automatically created  (b) ones?
>
> Or if I am under some misconception, I'd be happily educated.
>
> Thanks.
>
> Lou.
>
> On Mon, Nov 5, 2018 at 3:49 PM Alain RODRIGUEZ  wrote:
>
>> Hello Lou,
>>
>> how do you clear the automatic ones (e.g. names unknown) without clearing
>>> the named ones?
>>>
>>
>> The option '-t' might be what you are looking for: 'nodetool
>> clearsnapshot -t nameOfMySnapshot'.
>>
>> From the documentation here:
>> http://cassandra.apache.org/doc/latest/tools/nodetool/clearsnapshot.html?highlight=clearsnapshot
>>
>> Le lun. 5 nov. 2018 à 13:38, Lou DeGenaro  a
>> écrit :
>>
>>> The issue really is how to manage disk space.  It is certainly possible
>>> to take snapshots by name and delete them by name, perhaps one for each day
>>> of the week.  But how do you clear the automatic ones (e.g. names unknown)
>>> without clearing the named ones?
>>>
>>> Thanks.
>>>
>>> Lou.
>>>
>>> On Fri, Nov 2, 2018 at 12:28 PM Oleksandr Shulgin <
>>> oleksandr.shul...@zalando.de> wrote:
>>>
>>>> On Fri, Nov 2, 2018 at 5:15 PM Lou DeGenaro 
>>>> wrote:
>>>>
>>>>> I'm looking to hear how others are coping with snapshots.
>>>>>
>>>>> According to the doc:
>>>>> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupDeleteSnapshot.html
>>>>>
>>>>> *When taking a snapshot, previous snapshot files are not automatically
>>>>> deleted. You should remove old snapshots that are no longer needed.*
>>>>>
>>>>> *The nodetool clearsnapshot
>>>>> <https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsClearSnapShot.html>
>>>>> command removes all existing snapshot files from the snapshot directory of
>>>>> each keyspace. You should make it part of your back-up process to clear 
>>>>> old
>>>>> snapshots before taking a new one.*
>>>>>
>>>>> But if you delete first, then there is a window of time when no
>>>>> snapshot exists until the new one is created.  And with a single snapshot
>>>>> there is no recovery further back than it.
>>>>>
>>>> You can also delete specific snapshot, by passing its name to the
>>>> clearsnapshot command.  For example, you could use snapshot date as part of
>>>> the name.  This will also prevent removing snapshots which were taken for
>>>> reasons other than backup, like the automatic snapshot due to running
>>>> TRUNCATE or DROP commands, or any other snapshots which might have been
>>>> created manually by the operators.
>>>>
>>>> Regards,
>>>> --
>>>> Alex
>>>>
>>>>

Re: snapshot strategy?

2018-11-05 Thread Alain RODRIGUEZ

Hello Lou,

how do you clear the automatic ones (e.g. names unknown) without clearing
> the named ones?
>

The option '-t' might be what you are looking for: 'nodetool clearsnapshot
-t nameOfMySnapshot'.

>From the documentation here:
http://cassandra.apache.org/doc/latest/tools/nodetool/clearsnapshot.html?highlight=clearsnapshot

Le lun. 5 nov. 2018 à 13:38, Lou DeGenaro  a écrit :

> The issue really is how to manage disk space.  It is certainly possible to
> take snapshots by name and delete them by name, perhaps one for each day of
> the week.  But how do you clear the automatic ones (e.g. names unknown)
> without clearing the named ones?
>
> Thanks.
>
> Lou.
>
> On Fri, Nov 2, 2018 at 12:28 PM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Fri, Nov 2, 2018 at 5:15 PM Lou DeGenaro 
>> wrote:
>>
>>> I'm looking to hear how others are coping with snapshots.
>>>
>>> According to the doc:
>>> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsBackupDeleteSnapshot.html
>>>
>>> *When taking a snapshot, previous snapshot files are not automatically
>>> deleted. You should remove old snapshots that are no longer needed.*
>>>
>>> *The nodetool clearsnapshot
>>> 
>>> command removes all existing snapshot files from the snapshot directory of
>>> each keyspace. You should make it part of your back-up process to clear old
>>> snapshots before taking a new one.*
>>>
>>> But if you delete first, then there is a window of time when no snapshot
>>> exists until the new one is created.  And with a single snapshot there is
>>> no recovery further back than it.
>>>
>> You can also delete specific snapshot, by passing its name to the
>> clearsnapshot command.  For example, you could use snapshot date as part of
>> the name.  This will also prevent removing snapshots which were taken for
>> reasons other than backup, like the automatic snapshot due to running
>> TRUNCATE or DROP commands, or any other snapshots which might have been
>> created manually by the operators.
>>
>> Regards,
>> --
>> Alex
>>
>>

Re: Scrub - autocompaction

2018-10-29 Thread Alain RODRIGUEZ

Hello,

should autocompaction be disabled before running scrub?
>

I would say the other way around, to be sure to leave some room to regular
compactions, you could try to run scrub with the following option

`nodetool scrub -j 1` (-j / --jobs allow controlling the number of
compactor threads to use). Depending on the rhythm you want to give to
compaction operations, you can tune this number and the
‘concurrent_compactor' option.

It seems that scrub processes each new created table and never ends.


Depending on the command that was run, if nothing was specified, nodetool
scrub might be running on the whole node (or keyspace?), which can be a
lot, so when an SSTable finishes, another might well start, for quite a
while. But you should reach an end as the scrub is applied to a list of
sstables calculated scrub start (I believe).

Here is what I can share from my understanding. If you believe there might
be an issue with your version of Cassandra or want to make sure that
'nodetool scrub' behaves as described, reading the code and/or observing
writ time of the files that are getting compacted (SCRUB) is a way to go.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le lun. 29 oct. 2018 à 16:18, Vlad  a écrit :

> Hi,
>
> should autocompaction be disabled before running scrub?
> It seems that scrub processes each new created table and never ends.
>
> Thanks.
>

Re: nodetool status and node maintenance

2018-10-26 Thread Alain RODRIGUEZ

Hello

Any way to temporarily make the node under maintenance invisible  from
> "nodetool status" output?
>

I don't think so.
I would use a different approach like for example only warn/email when the
node is down for 30 seconds or a minute depending on how long it takes for
your nodes to restart. This way the failure is not invisible, but ignored
when only bouncing the nodes.

As a side note, be aware that the 'nodetool status' only give a view of the
cluster from a specific node, that can be completely wrong as well :).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 26 oct. 2018 à 15:16, Saha, Sushanta K <
sushanta.s...@verizonwireless.com> a écrit :

> I have script that parses "nodetool status" output and emails alerts if
> any node is down. So, when I stop cassandra on a node for maintenance, all
> nodes stats emailing alarms.
>
> Any way to temporarily make the node under maintenance invisible  from
> "nodetool status" output?
>
> Thanks
>
>

Re: Insert from Select - CQL

2018-10-25 Thread Alain RODRIGUEZ

> Does anyone have any ideas of what I can do to generate inserts based on
> primary key numbers in an excel spreadsheet?


A quick thought:

What about using a column of the spreadsheet to actually store the SELECT
result and generate the INSERT statement (and I would probably do the
DELETE too) corresponding to each row using the power of the spreadsheet to
write the query once and have it for all the partitions with the proper
values?

The spreadsheet would then be your backup somehow.

We are a bit far from any Cassandra advice, but that's my first thought on
your problem, use the spreadsheet :).
Another option is probably to SELECT these rows and INSERT them into some
other Cassandra table (same cluster or not). Here you would have to code it
I think (client app of any kind)
This might not a good fit, but just in case, you might want to check at the
'COPY' statement:
https://stackoverflow.com/questions/21363046/how-to-select-data-from-a-table-and-insert-into-another-table
I'm not too sure what suits you the best.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mer. 24 oct. 2018 à 12:46, Philip Ó Condúin  a
écrit :

> Hi All,
>
> I have a problem that I'm trying to work out and can't find anything
> online that may help me.
>
> I have been asked to delete 4K records from a Column Family that has a
> total of 1.8 million rows.  I have been given an excel spreadsheet with a
> list of the 4K PRIMARY KEY numbers to be deleted.  Great, the delete will
> be easy anyway.
>
> But before I delete them I want to take a backup of what I'm deleting
> before I do, so that if the customer comes along and says they got the
> wrong numbers then I can quickly restore one or all of them.
> I have been trying to figure out how I can generate inserts from a select
> but it looks like this is not possible.
>
> I'm using centos and Cassandra 2.11
>
> Does anyone have any ideas of what I can do to generate inserts based on
> primary key numbers in an excel spreadsheet?
>
> Kind Regards,
> Phil
>
>
>

Re: Cleanup cluster after expansion?

2018-10-25 Thread Alain RODRIGUEZ

Hello,

'*nodetool cleanup*' use to be mono-threaded (up to C*2.1) then used all
the cores (C*2.1 - C*2.1.14) and is now something that can be controlled
(C*2.1.14+):
'*nodetool cleanup -j 2*' for example would use 2 compactors maximum (out
of the number of concurrent_compactors you defined (probably no more than
8).

*Global*: My advice would be to run on all nodes with a 1 or 2 threads
(never more than half of what's available). The impact of the cleanup
should not be bigger than the impact of a compaction. Also, be sure to
leave some room for regular compactions. This way, the cleanup should be
rather safe and it should be acceptable to run it in parallel in most
cases. In parallel you will save time to move to other operations quickly,
but generally there is no rush to run cleanup per se. So it's up to you to
run it in parallel or not. I often did, fwiw.

*Early Cassandra 2.1:* If you're using a Cassandra version between 2.1 and
2.1.14, I would go 1 node at the time, as you cannot really control the
number of threads. This operation in early C*2.1 is risky and heavy,
upgrade if you can, then cleanup would be my advice here :). Be careful
there if you decide to go for the cleanup anyway. Monitor pending
compaction stacking and disk space used mostly. In worst case you want to
have 50% of the disk free before starting cleanups.

*Note: *Reducing disk space usage - If disk space available is low or if
you mind the data size variation, you can run the cleanup per* tables*
sequentially, one by one, instead of running it on the whole node or
keyspace. Cleanups are going through compactions that starts by increasing
the used disk space to write temporary SSTables. Most of the disk space is
freed at the end of the cleanup operation. Going one table at the time and
with a low number of threads helped me in the past running cleanups in the
most extreme conditions.
Here is how this could be run (you may need to adapt this):

```
*screen -R cleanup*
*# From screen:*
*for ks in "myks yourks whateverks"; do tables=$(ls
/var/lib/cassandra/data/$ks | sort | cut -d "-" -f 1); for table in
$tables; do echo "Running nodetool cleanup on $ks.$table..."; nodetool
cleanup -j 2 $ks $table; done; done*
```

The screen is a good idea to answer the question 'Did the cleanup finish?'.
You get back to the screen and see if the command returned or not and you
don't have to kill the command just after running it.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 22 oct. 2018 à 21:18, Jeff Jirsa  a écrit :

> Nodetool will eventually return when it’s done
>
> You can also watch nodetool compactionstats
>
> --
> Jeff Jirsa
>
>
> > On Oct 22, 2018, at 10:53 AM, Ian Spence 
> wrote:
> >
> > Environment: Cassandra 2.2.9, GNU/Linux CentOS 6 + 7. Two DCs, 3 RACs in
> DC1 and 6 in DC2.
> >
> > We recently added 16 new nodes to our 38-node cluster (now 54 nodes).
> What would be the safest and most
> > efficient way of running a cleanup operation? I’ve experimented with
> running cleanup on a single node and
> > nodetool just hangs, but that seems to be a known issue.
> >
> > Would something like running it on a couple of nodes per day, working
> through the cluster, work?
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Upgrade to version 3

2018-10-18 Thread Alain RODRIGUEZ

Hello,

You might want to have a look at
https://issues.apache.org/jira/browse/CASSANDRA-14823

It seems that you could face a *data loss* while upgrading to Cassandra
3.11.3. Apparently, it is still somewhat unsafe to upgrade Cassandra to
C*3, even if you use the latest C*3.0.17/3.11.3. According to Blake who
reported and worked on the fix:

[...], which will lead to duplicate start bounds being emitted, and
> incorrect dropping of rows in some cases
>

It's a bug that was recently fixed and that should be soon released, I hope.

I imagine that a lot of people did this upgrade already, it might be just
fine for you as well. Yet you might want to explore this issue and maybe
consider to wait for this patch to be released to reduce the risks (or
apply this patch yourself).

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 18 oct. 2018 à 12:31, Anup Shirolkar 
a écrit :

> Hi,
>
> Yes you can upgrade from 2.2 to 3.11.3
>
> The steps for upgrade are there on lots of blogs and sites.
>
>  You can follow:
>
> https://myopsblog.wordpress.com/2017/12/04/upgrade-cassandra-cluster-from-2-x-to-3-x/
>
> You should read the NEWS.txt for information on any release while planning
> for upgrade.
> https://github.com/apache/cassandra/blob/trunk/NEWS.txt
>
> Please see below mail archive for your case of 2.2 to 3.x :
> https://www.mail-archive.com/user@cassandra.apache.org/msg45381.html
>
> Regards,
>
> Anup Shirolkar
>
>
>
>
> On Thu, 18 Oct 2018 at 09:30, Mun Dega  wrote:
>
>> Hello,
>>
>> If we are upgrading from version 2.2 to 3.x, should we go directly to
>> latest version 3.11.3?
>>
>> Anything we need to look out for?  If anyone can point to an upgrade
>> process that would be great!
>>
>>
>>

Re: Mview in cassandra

2018-10-17 Thread Alain RODRIGUEZ

Hello,

The error might be related to your specific clusters (sstableloader / 1
node), I imagine connectivity might be wrong or the data not be loaded
properly for some reason. The data has to be available (nodes up - and
maybe 'nodetool refresh'/'cassandra restart' after the sstableloader things)
and able to connect to each other.

I tried to drop the Materialized View and recreate it , but the data is not
> getting populated with version 3.11.1
>

What error do you have then, when recreating (if any)?
Are normal reads working? And when using a consistency of 'all'?

I was overall willing to share this information that might matter to you
about MVs:
The recommendation around Materialized view at the moment is as follow:
"Do not use them unless you know exactly how they (don't) work."

http://mail-archives.apache.org/mod_mbox/cassandra-user/201710.mbox/%3cetpan.59f24f38.438f4e99.7...@apple.com%3E

According to Blake (but I think there is a large consensus on this opinion):

Concerns about MV’s suitability for production are not uncommon, and this
> just formalizes
> the advice often given to people considering materialized views. That is:
> materialized views
> have shortcomings that can make them unsuitable for the general use case.
> If you’re not
> familiar with their shortcomings and confident they won’t cause problems
> for your use case,
> you shouldn’t use them



The shortcomings I’m referring to are:
> * There's no way to determine if a view is out of sync with the base table.
> * If you do determine that a view is out of sync, the only way to fix it
> is to drop and rebuild
> the view.
> Even in the happy path, there isn’t an upper bound on how long it will
> take for updates
> to be reflected in the view.


It was even shared that the feature is complex and that we (community, but
I *think* also committers and developers) don't have a complete
understanding of this feature.
You might want to consider to look for a workaround not involving MVs.
Also, version C*3.11.2 do not seem to have any improvement regarding MV
compared to C*3.11.1, it was just marked as experimental it seems. Thus
upgrading will not help probably.

I imagine it's not what you wanted to hear, but I would really not stick
with MVs at the moment. If this cluster would be under my responsibility I
would probably consider redesign the schema.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le mar. 16 oct. 2018 à 18:25, rajasekhar kommineni  a
écrit :

> Hi,
>
> I am seeing below warning message in system.log after datacopy using
> sstabloader.
>
> WARN  [CompactionExecutor:972] 2018-10-15 22:20:39,308
> ViewBuilder.java:189 - Materialized View failed to complete, sleeping 5
> minutes before restarting
> org.apache.cassandra.exceptions.UnavailableException: Cannot achieve
> consistency level ONE
>
> I tried to drop the Materialized View and recreate it , but the data is
> not getting populated with version 3.11.1
>
> I tried the same on version 3.11.2 on single node dev box and I can query
> the Materialized View with data. Any body have some experiences with
> Mview’s.
>
> Thanks,
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Connections info

2018-10-05 Thread Alain RODRIGUEZ

Hello Abdul,

I was caught by a different topic while answering, sending the message
over, even though it's similar to Romain's solution.

There is the metric mentioned above, or to have more details such as the
app node IP, you can use:

$ sudo netstat -tupawn | grep 9042 | grep ESTABLISHED

tcp0  0 ::::*9042*   :::*
LISTEN  -

tcp0  0 ::::*9042*
::::51486  ESTABLISHED
-

tcp0  0 ::::*9042*
::::37624  ESTABLISHED
-
[...]

tcp0  0 ::::*9042*
::::49108  ESTABLISHED
-

or to count them:

$ sudo netstat -tupawn | grep 9042 | grep ESTABLISHED | wc -l

113

I'm not sure about the '-tupawn' options, it gives me the format I need and
I never wondered much about it I must say. Maybe some of the options are
useless.

Sending this command through ssh would allow you to gather the information
in one place. You can also run similar commands on the clients (Apps) toI
hope that helps.

C*heers
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 5 oct. 2018 à 06:28, Max C.  a écrit :

> Looks like the number of connections is available in JMX as:
>
> org.apache.cassandra.metrics:type=Client,name=connectedNativeClients
>
> http://cassandra.apache.org/doc/4.0/operating/metrics.html
>
> "Number of clients connected to this nodes native protocol server”
>
> As for where they’re coming from — I’m not sure how to get that from JMX.
> Maybe you’ll have to use “lsof” or something to get that.
>
> - Max
>
> On Oct 4, 2018, at 8:57 pm, Abdul Patel  wrote:
>
> Hi All,
>
> Can we get number of users connected to each node in cassandra?
> Also can we get from whixh app node they are connecting from?
>
>
>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-10-05 Thread Alain RODRIGUEZ

I feel you for most of the troubles you faced, I've been facing most of
them too. Again, Datadog support can probably help you with most of those.
You should really consider sharing this feedback to them.

there is re-namespacing of the metric names in lots of cases, and these
> don't appear to be centrally documented, but maybe i haven't found the
> magic page.
>

I don't know if that would be the 'magic' page, but that's something:
https://github.com/DataDog/integrations-core/blob/master/cassandra/metadata.csv

There are so many good stats.


Yes, and it's still improving. I love this about Cassandra. It's our work
to pick the relevant ones for each situation. I would not like Cassandra to
reduce the number of metrics exposed, we need to learn to handle them
properly. Also, this is the reason we designed 4 dashboards out the box,
the goal was to have everything we need for distinct scenarios:
- Overview - global health-check / anomaly detection
- Read Path - troubleshooting / optimizing read ops
- Write Path - troubleshooting / optimizing write ops
- SSTable Management - troubleshooting / optimizing -
comapction/flushes/... anything related to sstables.

instead of the single overview dashboard that was present before. We are
also perfectly aware that it's far from perfect, but aiming at perfect
would only have had us never releasing anything. Anyone interested could
now build missing dashboards or improve existing ones for himself or/and
suggest improvements to Datadog :). I hope I'll do some more of this work
at some point in the future.

Good luck,
C*heers,
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 4 oct. 2018 à 21:21, Carl Mueller
 a écrit :

> for 2.1.x we had a custom reporter that delivered  metrics to datadog's
> endpoint via https, bypassing the agent-imposed 350. But integrating that
> required targetting the other shared libs in the cassandra path, so the
> build is a bit of a pain when we update major versions.
>
> We are migrating our 2.1.x specific dashboards, and we will use
> agent-delivered metrics for non-table, and adapt the custom library to
> deliver the table-based ones, at a slower rate than the "core" ones.
>
> Datadog is also super annoying because there doesn't appear to be anything
> that reports what metrics the agent is sending (the metric count can
> indicate if a configured new metric increased the count and is being
> reported, but it's still... a guess), and there is re-namespacing of the
> metric names in lots of cases, and these don't appear to be centrally
> documented, but maybe i haven't found the magic page.
>
> There are so many good stats. We might also implement some facility to
> dynamically turn on the delivery of detailed metrics on the nodes.
>
> On Tue, Oct 2, 2018 at 5:21 AM Alain RODRIGUEZ  wrote:
>
>> Hello Carl,
>>
>> I guess we can use bean_regex to do specific targetted metrics for the
>>> important tables anyway.
>>>
>>
>> Yes, this would work, but 350 is very limited for Cassandra dashboards.
>> We have a LOT of metrics available.
>>
>> Datadog 350 metric limit is a PITA for tables once you get over 10 tables
>>>
>>
>> I noticed this while I was working on providing default dashboards for
>> Cassandra-Datadog integration. I was told by Datadog team it would not be
>> an issue for users, that I should not care about it. As you pointed out,
>> per table metrics quickly increase the total number of metrics we need to
>> collect.
>>
>> I believe you can set the following option: *"max_returned_metrics:
>> 1000"* - it can be used if metrics are missing to increase the limit of
>> the number of collected metrics. Be aware of CPU utilization that this
>> might imply (greatly improved in dd-agent version 6+ I believe -thanks
>> Datadog teams for that- making this fully usable for Cassandra). This
>> option should go in the *cassandra.yaml* file for Cassandra
>> integrations, off the top of my head.
>>
>> Also, do not hesitate to reach to Datadog directly for this kind of
>> questions, I have always been very happy with their support so far, I am
>> sure they would guide you through this as well, probably better than we can
>> do :). It also provides them with feedback on what people are struggling
>> with I imagine.
>>
>> I am interested to know if you still have issues getting more metrics
>> (option above not working / CPU under too much load) as this would make the
>> dashboards we built mostly unusable for clusters with more tables. We might
>> then need to review the design.
>>
>&

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-10-02 Thread Alain RODRIGUEZ

Hello Carl,

I guess we can use bean_regex to do specific targetted metrics for the
> important tables anyway.
>

Yes, this would work, but 350 is very limited for Cassandra dashboards. We
have a LOT of metrics available.

Datadog 350 metric limit is a PITA for tables once you get over 10 tables
>

I noticed this while I was working on providing default dashboards for
Cassandra-Datadog integration. I was told by Datadog team it would not be
an issue for users, that I should not care about it. As you pointed out,
per table metrics quickly increase the total number of metrics we need to
collect.

I believe you can set the following option: *"max_returned_metrics: 1000"* -
it can be used if metrics are missing to increase the limit of the number
of collected metrics. Be aware of CPU utilization that this might imply
(greatly improved in dd-agent version 6+ I believe -thanks Datadog teams
for that- making this fully usable for Cassandra). This option should go in
the *cassandra.yaml* file for Cassandra integrations, off the top of my
head.

Also, do not hesitate to reach to Datadog directly for this kind of
questions, I have always been very happy with their support so far, I am
sure they would guide you through this as well, probably better than we can
do :). It also provides them with feedback on what people are struggling
with I imagine.

I am interested to know if you still have issues getting more metrics
(option above not working / CPU under too much load) as this would make the
dashboards we built mostly unusable for clusters with more tables. We might
then need to review the design.

As a side note, I believe metrics are handled the same way cross version,
they got the same name/label for C*2.1, 2.2 and 3+ on Datadog. There is an
abstraction layer that removes this complexity (if I remember well, we
built those dashboards a while ago).

C*heers
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 1 oct. 2018 à 19:38, Carl Mueller
 a écrit :

> That's great too, thank you.
>
> Datadog 350 metric limit is a PITA for tables once you get over 10 tables,
> but I guess we can use bean_regex to do specific targetted metrics for the
> important tables anyway.
>
> On Mon, Oct 1, 2018 at 4:21 AM Alain RODRIGUEZ  wrote:
>
>> Hello Carl,
>>
>> Here is a message I sent to my team a few months ago. I hope this will be
>> helpful to you and more people around :). It might not be exhaustive and we
>> were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but C*2.2
>> is similar to C*3.0 if I remember correctly in terms of metrics. Here it is
>> for what it's worth:
>>
>> Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
>> - ColumnFamily --> Table
>> - XXpercentile --> pXX
>> - 1MinuteRate -->  m1_rate
>> - metric name before KS and Table names and some other changes of this
>> kind.
>> - ^ aggregations / aliases indexes changed because of this (using
>> graphite for example) ^
>> - ‘.value’ is not appended in the metric name anymore for gauges, nothing
>> instead.
>>
>> For example (graphite):
>>
>> From
>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
>> 2, 3), 1, 7, 8, 9)
>>
>> to
>> aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
>> 2, 3), 1, 8, 9, 10)
>>
>> C*heers,
>> ---
>> Alain Rodriguez - @arodream - al...@thelastpickle.com
>> France / Spain
>>
>> The Last Pickle - Apache Cassandra Consulting
>> http://www.thelastpickle.com
>>
>> Le ven. 28 sept. 2018 à 20:38, Carl Mueller
>>  a écrit :
>>
>>> VERY NICE! Thank you very much
>>>
>>> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
>>> lyuben.todo...@instaclustr.com> wrote:
>>>
>>>> Nothing as fancy as a matrix but a list of what JMX term can see.
>>>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>>>
>>>> /lyubent
>>>>
>>>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>>>  wrote:
>>>>
>>>>> It's my understanding that metrics got heavily re-namespaced in JMX
>>>>> for 2.2 from 2.1
>>>>>
>>>>> Did anyone ever make a migration matrix/guide for conversion of old
>>>>> metrics to new metrics?
>>>>>
>>>>>
>>>>>

Re: Re: Re: how to configure the Token Allocation Algorithm

2018-10-01 Thread Alain RODRIGUEZ

Hello again :),

I thought a little bit more about this question, and I was actually
wondering if something like this would work:

Imagine 3 node cluster, and create them using:
For the 3 nodes: `num_token: 4`
Node 1: `intial_token: -9223372036854775808, -4611686018427387905, -2,
4611686018427387901`
Node 2: `intial_token: -7686143364045646507, -3074457345618258604,
1537228672809129299, 6148914691236517202`
Node 3: `intial_token: -6148914691236517206, -1537228672809129303,
3074457345618258600, 7686143364045646503`

 If you know the initial size of your cluster, you can calculate the total
number of tokens: number of nodes * vnodes and use the formula/python code
above to get the tokens. Then use the first token for the first node, move
to the second node, use the second token and repeat. In my case there is a
total of 12 tokens (3 nodes, 4 tokens each)
```
>>> number_of_tokens = 12
>>> [str(((2**64 / number_of_tokens) * i) - 2**63) for i in
range(number_of_tokens)]
['-9223372036854775808', '-7686143364045646507', '-6148914691236517206',
'-4611686018427387905', '-3074457345618258604', '-1537228672809129303',
'-2', '1537228672809129299', '3074457345618258600', '4611686018427387901',
'6148914691236517202', '7686143364045646503']
```

it actually works nicely apparently. Here is a quick ccm test I have run,
with the configuration above:

```

$ ccm node1 nodetool status tlp_lab


Datacenter: datacenter1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  AddressLoad   Tokens   Owns (effective)  Host ID
Rack

UN  127.0.0.1  82.47 KiB  466.7%
1ed8680b-7250-4088-988b-e4679514322f  rack1

UN  127.0.0.2  99.03 KiB  466.7%
ab3655b5-c380-496d-b250-51b53efb4c00  rack1

UN  127.0.0.3  82.36 KiB  466.7%
ad2b343e-5f6e-4b0d-b79f-a3dfc3ba3c79  rack1
```

Ownership is perfectly distributed, like it would be without vnodes. Tested
with C* 3.11.1 and CCM.

I followed the procedure we were talking about on my second test, after
wiping out the data in my 3 nodes ccm cluster.
RF=2 for tlp_lab, the first node with initial_tokens defined and other
nodes using 'allocate_tokens_for_keyspace: tlp_lab':

$ ccm node1 nodetool status tlp_lab


Datacenter: datacenter1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  AddressLoad   Tokens   Owns (effective)  Host ID
Rack

UN  127.0.0.1  86.71 KiB  496.2%
6e4c0ce0-2e2e-48ff-b7e0-3653e76366a3  rack1

UN  127.0.0.2  65.63 KiB  454.2%
592cda85-5807-4e7a-aa3b-0d9ae54cfaf3  rack1

UN  127.0.0.3  99.04 KiB  449.7%
f2c4eccc-31cc-458c-a599-5373c1169d3c  rack1

This is not as great. I guess a fourth node would help, but still not make
it as perfect.

I would still check what happens when you add a few more nodes with
'allocate_tokens_for_keyspace' afterward and without 'initial_token', not
to have any surprise.
I did not see anyone using this yet. Please take it as an idea to dig, and
not as a recommendation :).

I also noticed I did not answer the second part of the mail:

My cluster Size won't go beyond 150 nodes, should i still use The
> Allocation Algorithm instead of random with 256 tokens (performance wise or
> load-balance wise)?
>

I would say yes. There is a talk to change this default (256 vnodes), that
is now probably always a bad idea since 'allocate_tokens_for_keyspace' was
added.

Is the Allocation Algorithm, widely used and tested with Community and can
> we migrate all clusters with any size to use this Algorithm Safely?
>

Here again, I would say yes. I am not sure that it is widely used yet, but
I think so. Also, you can always check the ownership with 'nodetool status
' after adding the nodes, and before adding data or traffic to
this data center, so there is probably no real risk if you check ownership
distribution after adding nodes. If you don't like the distribution, you
can decommission the nodes, clean them, and try again, I use to call it
'rolling the dice' when I am still using the random algorithm :). I mean,
once the token ranges ownership are distributed to the nodes, it does not
change anything during the transaction. We don't need this 'algorithm'
after the bootstrap I would say.


> Out of Curiosity, i wonder how people (i.e, in Apple) config and maintain
> token management of clusters with thousands of nodes?
>

I am not sure about Apple, but my understanding is that some of those
companies don't use vnodes and have a 'ring management tool' to perform the
necessary 'nodetool move' around the cluster relatively easily or
automatically. Some other probably use a low number of vnodes (something
between 4 and 32) and 'allocate_tokens_for_keyspace'.

Also, my understanding is that it's very rare to have clusters with
thousands of nodes. You can then start having issues around gossip if I
remember correctly what I read/discussed. I would probably add a second
cluster

Re: unsubscribe

2018-10-01 Thread Alain RODRIGUEZ

Hello,

You're still subscribed to this mailing list I am afraid :). In case you
missed it:

To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org


C*heers.

Le lun. 1 oct. 2018 à 08:04, Gabriel Lindeborg <
gabriel.lindeb...@svenskaspel.se> a écrit :

>
> AB SVENSKA SPEL
> 621 80 Visby
> Norra Hansegatan 17, Visby
> Växel: +4610-120 00 00
> https://svenskaspel.se
>
> Please consider the environment before printing this email
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: how to configure the Token Allocation Algorithm

2018-10-01 Thread Alain RODRIGUEZ

Hello,

Your process looks good to me :). Still a couple of comments to make it
more efficient (hopefully).

*- Improving step 2:*

I believe you can actually get a slightly better distribution picking the
tokens for the (first) seed node. This is to prevent the node from randomly
calculating its token ranges. You can calculate the token ranges using the
following python code:

$ python  # Start the python shell
[...]
>>> number_of_tokens = 8
>>> [str(((2**64 / number_of_tokens) * i) - 2**63) for i in 
>>> range(number_of_tokens)]
['-9223372036854775808', '-6917529027641081856',
'-4611686018427387904', '-2305843009213693952', '0',
'2305843009213693952', '4611686018427387904', '6917529027641081856']


Set the 'initial_token' with the above list (coma separated list) and the
number of vnodes to 'num_tokens: 8'.

This technique proved to be way more efficient (especially for low token
numbers / small number of nodes). Luckily it's also easy to test.

- *Step 4 might not be needed*

I don't see the need of stopping/starting the seed. The option
'allocate_tokens_for_keyspace'
won't affect this seed node (already initialized) in any way

Also, do not forget to have more nodes becoming 'seeds', either after
bootstrap or just start a couple more of seeds after the first one for
example.

C*heers,
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



Le lun. 1 oct. 2018 à 07:16, onmstester onmstester  a
écrit :

> Since i failed to find a document on how to configure and use the Token
> Allocation Algorithm (to replace the random Algorithm), just wanted to be
> sure about the procedure i've done:
> 1. Using Apache Cassandra 3.11.2
> 2. Configured one of seed nodes with num_tokens=8 and started it.
> 3. Using Cqlsh created keyspace test with NetworkTopologyStrategy and RF=3.
> 4. Stopped the seed node.
> 5. add this line to cassandra.yaml of all nodes (all have num_tokens=8)
> and started the cluster:
> allocate_tokens_for_keyspace=test
>
> My cluster Size won't go beyond 150 nodes, should i still use The
> Allocation Algorithm instead of random with 256 tokens (performance wise or
> load-balance wise)?
> Is the Allocation Algorithm, widely used and tested with Community and can
> we migrate all clusters with any size to use this Algorithm Safely?
> Out of Curiosity, i wonder how people (i.e, in Apple) config and maintain
> token management of clusters with thousands of nodes?
>
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>

Re: Metrics matrix: migrate 2.1.x metrics to 2.2.x+

2018-10-01 Thread Alain RODRIGUEZ

Hello Carl,

Here is a message I sent to my team a few months ago. I hope this will be
helpful to you and more people around :). It might not be exhaustive and we
were moving from C*2.1 to C*3+ in this case, thus skipping C*2.2, but C*2.2
is similar to C*3.0 if I remember correctly in terms of metrics. Here it is
for what it's worth:

Quite a few things changed between metric reporter in C* 2.1 and C*3.0.
- ColumnFamily --> Table
- XXpercentile --> pXX
- 1MinuteRate -->  m1_rate
- metric name before KS and Table names and some other changes of this kind.
- ^ aggregations / aliases indexes changed because of this (using graphite
for example) ^
- ‘.value’ is not appended in the metric name anymore for gauges, nothing
instead.

For example (graphite):

From
aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.ColumnFamily.$ks.$table.ReadLatency.95percentile,
2, 3), 1, 7, 8, 9)

to
aliasByNode(averageSeriesWithWildcards(cassandra.$env.$dc.$host.org.apache.cassandra.metrics.Table.ReadLatency.$ks.$table.p95,
2, 3), 1, 8, 9, 10)

C*heers,
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le ven. 28 sept. 2018 à 20:38, Carl Mueller
 a écrit :

> VERY NICE! Thank you very much
>
> On Fri, Sep 28, 2018 at 1:32 PM Lyuben Todorov <
> lyuben.todo...@instaclustr.com> wrote:
>
>> Nothing as fancy as a matrix but a list of what JMX term can see.
>> Link to the online diff here: https://www.diffchecker.com/G9FE9swS
>>
>> /lyubent
>>
>> On Fri, 28 Sep 2018 at 19:04, Carl Mueller
>>  wrote:
>>
>>> It's my understanding that metrics got heavily re-namespaced in JMX for
>>> 2.2 from 2.1
>>>
>>> Did anyone ever make a migration matrix/guide for conversion of old
>>> metrics to new metrics?
>>>
>>>
>>>

Re: TTL tombstones in Cassandra using LCS are cretaed in the same level data TTLed data?

2018-09-27 Thread Alain RODRIGUEZ

Hello Gabriel,

Another clue to explore would be to use the TTL as a default value if
> that's a good fit. TTLs set at the table level with 'default_time_to_live'
> should not generate any tombstone at all in C*3.0+. Not tested on my hand,
> but I read about this.
>

As explained on a parallel thread, this is wrong ^, mea culpa. I believe
the rest of my comment still stands (hopefully :)).

I'm not sure what it means with "*in-place*" since SSTables are immutable.
> [...]

 My guess is that is referring to tombstones being created in the same
> level (but different SStables) that the TTLed data during a compaction
> triggered


Yes, I believe during the next compaction following the expiration date,
the entry is 'transformed' into a tombstone, and lives in the SSTable that
is the result of the compaction, on the level/bucket this SSTable is put
into. That's why I said 'in-place' which is indeed a bit weird for
immutable data.

As a side idea for your problem, on 'modern' versions of Cassandra (I don't
remember the version, that's what 'modern' means ;-)), you can run
'nodetool garbagecollect' regularly (not necessarily frequently) during the
off-peak period. That might use the cluster resources when you don't need
them to claim some disk space. Also making sure that a 2 years old record
is not being updated regularly by design would definitely help. In the
extreme case of writing a data once (never updated) and with a TTL for
example, I see no reason for a 2 years old data not to be evicted
correctly. As long as the disk can grow, it should be fine.

I would not be too much scared about it, as there is 'always' a way to
remove tombstones. Yet it's good to think about the design beforehand
indeed, generally, it's good if you can rotate the partitions over time,
not to reuse old partitions for example.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 25 sept. 2018 à 17:38, Gabriel Giussi  a
écrit :

> I'm using LCS and a relatively large TTL of 2 years for all inserted rows
> and I'm concerned about the moment at wich C* would drop the corresponding
> tombstones (neither explicit deletes nor updates are being performed).
>
> From [Missing Manual for Leveled Compaction Strategy](
> https://www.youtube.com/watch?v=-5sNVvL8RwI), [Tombstone Compactions in
> Cassandra](https://www.youtube.com/watch?v=pher-9jqqC4) and [Deletes
> Without Tombstones or TTLs](https://www.youtube.com/watch?v=BhGkSnBZgJA)
> I understand that
>
>  - All levels except L0 contain non-overlapping SSTables, but a partition
> key may be present in one SSTable in each level (aka distributed in all
> levels).
>  - For a compaction to be able to drop a tombstone it must be sure that is
> compacting all SStables that contains de data to prevent zombie data (this
> is done checking bloom filters). It also considers gc_grace_seconds
>
> So, for my particular use case (2 years TTL and write heavy load) I can
> conclude that TTLed data will be in highest levels so I'm wondering when
> those SSTables with TTLed data will be compacted with the SSTables that
> contains the corresponding SSTables.
> The main question will be: **Where are tombstones (from ttls) being
> created? Are being created at Level 0 so it will take a long time until it
> will end up in the highest levels (hence disk space will take long time to
> be freed)?**
>
> In a comment from [About deletes and tombstones](
> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html)
> Alain says that
> > Yet using TTLs helps, it reduces the chances of having data being
> fragmented between SSTables that will not be compacted together any time
> soon. Using any compaction strategy, if the delete comes relatively late in
> the row history, as it use to happen, the 'upsert'/'insert' of the
> tombstone will go to a new SSTable. It might take time for this tombstone
> to get to the right compaction "bucket" (with the rest of the row) and for
> Cassandra to be able to finally free space.
> **My understanding is that with TTLs the tombstones is created in-place**,
> thus it is often and for many reasons easier and safer to get rid of a TTLs
> than from a delete.
> Another clue to explore would be to use the TTL as a default value if
> that's a good fit. TTLs set at the table level with 'default_time_to_live'
> should not generate any tombstone at all in C*3.0+. Not tested on my hand,
> but I read about this.
>
> I'm not sure what it means with "*in-place*" since SSTables are
> immutable.
> (I also have some doubts about what it says of using
> `default_time_to_live` that I've asked in [How default_time_to_live would
> delete rows without tombstone

Re: How default_time_to_live would delete rows without tombstones in Cassandra?

2018-09-27 Thread Alain RODRIGUEZ

Hello Gabriel,

Sorry for not answering earlier. I should have, given that I contributed
spreading this wrong idea. I will also try to edit my comment in the post.
I have been fooled by the piece of documentation you mentioned when
answering this question on our blog. I probably answered this one too
quickly, even though I wrote this a thing 'to explore', even saying I did
not try it explicitely.

Another clue to explore would be to use the TTL as a default value if
> that's a good fit. TTLs set at the table level with
> 'default_time_to_live' **should not generate any tombstone at all in
> C*3.0+**. Not tested on my hand, but I read about this.


So my sentence above is wrong. Basically, the default can be overwritten by
the TTL at the query level and I do not see how Cassandra could handle this
without tombstones.

I spent time on the post and it was reviewed. I believe it is reliable. The
questions, on the other side, are answered by me alone and well, that only
reflects my opinion at the moment I am asked and I sometimes find enough
time and interest to dig topics, sometimes a bit less. So this is fully on
me, my apologies for this inaccuracy. I must say am always afraid when
writing publicly and sharing information to do this kind of mistakes and
mislead people. I hope the impact of this read was still positive for you
overall.

>From the example I conclude that isn't true that `default_time_to_live` not
> require tombstones, at least for version 3.0.13.
>

Also, I am glad to see you did not believe me or Datastax documentation but
tried it by yourself. This is definitively the right approach.

But how would C* delete without tombstones? Why this should be a different
> scenario to using TTL per insert?
>

Yes, exactly this,

C*heers.
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 17 sept. 2018 à 14:58, Gabriel Giussi  a
écrit :

>
> From
> https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlAboutDeletes.html
>
> > Cassandra allows you to set a default_time_to_live property for an
> entire table. Columns and rows marked with regular TTLs are processed as
> described above; but when a record exceeds the table-level TTL, **Cassandra
> deletes it immediately, without tombstoning or compaction**.
>
> This is also answered in https://stackoverflow.com/a/50060436/3517383
>
> >  If a table has default_time_to_live on it then rows that exceed this
> time limit are **deleted immediately without tombstones being written**.
>
> And commented in LastPickle's post About deletes and tombstones (
> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html#comment-3949581514
> )
>
> > Another clue to explore would be to use the TTL as a default value if
> that's a good fit. TTLs set at the table level with 'default_time_to_live'
> **should not generate any tombstone at all in C*3.0+**. Not tested on my
> hand, but I read about this.
>
> I've made the simplest test that I could imagine using
> `LeveledCompactionStrategy`:
>
> CREATE KEYSPACE IF NOT EXISTS temp WITH replication = {'class':
> 'SimpleStrategy', 'replication_factor': '1'};
>
> CREATE TABLE IF NOT EXISTS temp.test_ttl (
> key text,
> value text,
> PRIMARY KEY (key)
> ) WITH  compaction = { 'class': 'LeveledCompactionStrategy'}
>   AND default_time_to_live = 180;
>
>  1. `INSERT INTO temp.test_ttl (key,value) VALUES ('k1','v1');`
>  2. `nodetool flush temp`
>  3. `sstabledump mc-1-big-Data.db`
> [image: cassandra0.png]
>
>  4. wait for 180 seconds (default_time_to_live)
>  5. `sstabledump mc-1-big-Data.db`
> [image: cassandra1.png]
>
> The tombstone isn't created yet
>  6. `nodetool compact temp`
>  7. `sstabledump mc-2-big-Data.db`
> [image: cassandra2.png]
>
> The **tombstone is created** (and not dropped on compaction due to
> gc_grace_seconds)
>
> The test was performed using apache cassandra 3.0.13
>
> From the example I conclude that isn't true that `default_time_to_live`
> not require tombstones, at least for version 3.0.13.
> However this is a very simple test and I'm forcing a major compaction with
> `nodetool compact` so I may not be recreating the scenario where
> default_time_to_live magic comes into play.
>
> But how would C* delete without tombstones? Why this should be a different
> scenario to using TTL per insert?
>
> This is also asked at stackoverflow
> <https://stackoverflow.com/questions/52282517/how-default-time-to-live-would-delete-rows-without-tombstones-in-cassandra>
>

Re: Error during truncate: Cannot achieve consistency level ALL , how to fix it

2018-09-27 Thread Alain RODRIGUEZ

Hello Shyam,

I think Jonathan understood the meaning of 'RF'. He is suggesting you to
look at all your keyspaces strategy/RF for the system_auth table and make
sure to use the right replication factor (and probably
a NetworkTopologyStrategy).

This might help: cqlsh -e "DESCRIBE KEYSPACE system_auth;"
Or to check them all: cqlsh -e "DESCRIBE KEYSPACES;"

I don't know what's wrong exactly, but your application is truncating with
a consistency level of 'ALL', meaning all the replicas must be up for your
application to work.

Le mer. 19 sept. 2018 à 14:30, sha p  a écrit :

> RF is replication factor. Sorry for confusing
>
> On 19 Sep 2018 5:45 p.m., "Jonathan Baynes" 
> wrote:
>
> What RF is your system_auth keyspace?
>
>
>
> If its one, match it to the user keyspace, and restart the node.
>
>
>
> *From:* sha p [mailto:shatestt...@gmail.com]
> *Sent:* 19 September 2018 11:49
> *To:* user@cassandra.apache.org
> *Subject:* Error during truncate: Cannot achieve consistency level ALL ,
> how to fix it
>
>
>
> Hi All,
>
>  I am new to Cassandra. Following below link
>
>
>
>
> https://grokonez.com/spring-framework/spring-data/start-spring-data-cassandra-springboot#III_Sourcecode
> 
>
>
>
>
>
> I have three node cluster , keyspace set with RF = 2 , but when I run this
> application from above source code bellow error is thrown
>
>  Caused by: com.datastax.driver.core.exceptions.TruncateException: Error
> during truncate: Cannot achieve consistency level ALL """ 
>
>
>
>
>
> What wrong i am doing here ..How to fix it ? Plz help me.
>
>
>
> Regards,
>
> Shyam
>
> 
>
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and destroy it. Any unauthorized
> copying, disclosure or distribution of the material in this e-mail is
> strictly forbidden. Tradeweb reserves the right to monitor all e-mail
> communications through its networks. If you do not wish to receive
> marketing emails about our products / services, please let us know by
> contacting us, either by email at contac...@tradeweb.com or by writing to
> us at the registered office of Tradeweb in the UK, which is: Tradeweb
> Europe Limited (company number 3912826), 1 Fore Street Avenue London EC2Y
> 9DT. To see our privacy policy, visit our website @ www.tradeweb.com.
>
>
>

Re: Adding datacenter and data verification

2018-09-17 Thread Alain RODRIGUEZ

Hello Pradeep,

It looks good to me and it's a cool runbook for you to follow and for
others to reuse.

To make sure that cassandra nodes in one datacenter can see the nodes of
> the other datacenter, add the seed node of the new datacenter in any of the
> old datacenter’s nodes and restart that node.

Nodes seeing each other from the distinct rack is not related to seeds.
It's indeed recommended to use seeds from all the datacenter (a couple or
3). I guess it's to increase availability on seeds node and/or maybe to
make sure local seeds are available.

You can perfectly (and even have to) add your second datacenter nodes using
seeds from the first data center. A bootstrapping node should never be in
the list of seeds unless it's the first node of the cluster. Add nodes,
then make them seeds.

Le lun. 17 sept. 2018 à 11:25, Pradeep Chhetri  a
écrit :

> Hello everyone,
>
> Can someone please help me in validating the steps i am following to
> migrate cassandra snitch.
>
> Regards,
> Pradeep
>
> On Wed, Sep 12, 2018 at 1:38 PM, Pradeep Chhetri 
> wrote:
>
>> Hello
>>
>> I am running cassandra 3.11.3 5-node cluster on AWS with SimpleSnitch. I
>> was testing the process to migrate to GPFS using AWS region as the
>> datacenter name and AWS zone as the rack name in my preprod environment and
>> was able to achieve it.
>>
>> But before decommissioning the older datacenter, I want to verify that
>> the data in newer dc is in consistence with the one in older dc. Is there
>> any easy way to do that.
>>
>> Do you suggest running a full repair before decommissioning the nodes of
>> older datacenter ?
>>
>> I am using the steps documented here: https://medium.com/p/465e9bf28d99
>> I will be very happy if someone can confirm me that i am doing the right
>> steps.
>>
>> Regards,
>> Pradeep
>>
>>
>>
>>
>>
>

Re: Default Single DataCenter -> Multi DataCenter

2018-09-14 Thread Alain RODRIGUEZ

Hello,


> Q. Isn't it a problem that at this point, DCAwareRobinPolicy and
> RoundRobinPolicy coexist?


Why not change it on all clients at once? But no, it should not be a
problem as long as you have only 1 DC.

 Q1. Must I add the new seed nodes in five all existing nodes?
>

I generally pick 3 per DC that also become the contact points for the
clients to connect and get information about the whole cluster. Then I just
care about these nodes when removing them (it induces a change in the
client application configuration and on the seeds for all the
cassandra nodes). Change it and restart should be good.

Q2. Don't I need to update the seed node settings of the new cluster
> (mydc2)?
>

All nodes should use the same seed nodes. In this rack use, 2 or 3 nodes
from each rack on all the nodes (no matter the Data Center, each node
references all the seeds nodes).

The procedure looks to be about right, there are probably some details we
are missing while reading but that you will spot while running the test.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 11 sept. 2018 à 11:04, Eunsu Kim  a écrit :

> It’s self respond.
>
> Step3 is wrong.
>
> Even if it was a SimpleSnitch, changing the dc information will not start
> CassandraDaemon with the error log.
>
> ERROR [main] 2018-09-11 18:36:30,272 CassandraDaemon.java:708 - Cannot
> start node if snitch's data center (pg1) differs from previous data center
> (datacenter1). Please fix the snitch configuration, decommission and
> rebootstrap this node or use the flag -Dcassandra.ignore_dc=true.
>
>
> On 11 Sep 2018, at 2:25 PM, Eunsu Kim  wrote:
>
> Hello
>
> Thank you for your responses.
>
> I’ll share my adding datacenter plan. If you see problems, please respond.
>
> The sentence may be a little awkward because I am so poor at English that
> I am being helped by a translator.
>
> I've been most frequently referred to.(https://medium.com/p/465e9bf28d99) 
> Thank
> you for your cleanliness. (Pradeep Chhetri)
>
> I will also upgrade the version as Alain Rodriguez's advice.
>
> 
>
> Step 1. Stop all existing clusters. (My service is paused.)
>
> Step 2. Install Cassandra 3.11.3 and copy existing conf files.
>
> Step 3. Modify cassandra-rackdc.properties for existing nodes. dc=mydc1
> rack=myrack1
>  Q. I think this modification will not affect the existing data
> because it was SimpleSnitch before, right?
>
> Step 4. In the caseandra.yaml of existing nodes, endpoint_snitch is
> changed to GossippingPropertyFileSnitch.
>
> Step 5. Restart the Cassandra of the existing nodes. (My service is
> resumed.)
>
> Step 6. Change the settings of all existing clients to DCAwareRobinPolicy
> and refer to mydc1. Consistency level is LOCAL_ONE. And rolling restart.
>   Q. Isn't it a problem that at this point, DCAwareRobinPolicy and
> RoundRobinPolicy coexist?
>
> Step 7. Alter my keyspace and system keyspace(system_distributed,
> system_traces) :  SimpleStrategy(RF=2) -> { 'class' :
> 'NetworkTopologyStrategy', ‘mydc1’ : 2 }
>
> Step 8. Install Cassandra in a new cluster, copying existing conf files,
> and setting it to Cassandra-racdc.properties. dc=mydc2 rack=myrack2
>
> Step 9. Adding a new seed node to the cassandra.yaml of the existing
> cluster (mydc1) and restart.
>   Q1. Must I add the new seed nodes in five all existing nodes?
>   Q2. Don't I need to update the seed node settings of the new
> cluster (mydc2)?
>
> Step 10. Alter my keyspace and system keyspace(system_distributed,
> system_traces) :  { 'class' : 'NetworkTopologyStrategy', ‘mydc1’ :
> 1, ‘mydc2’ : 1 }
>
> Step 11. Run 'nodetool rebuild — mydc1’ in turn, in the new node.
>
> ———
>
>
> I'll run the procedure on the development envrionment and share it.
>
> Thank you.
>
>
>
>
> On 10 Sep 2018, at 10:26 PM, Pradeep Chhetri 
> wrote:
>
> Hello Eunsu,
>
> I am going through the same exercise at my job. I was making notes as i
> was testing the steps in my preproduction environment. Although I haven't
> tested end to end but hopefully this might help you:
> https://medium.com/p/465e9bf28d99
>
> Regards,
> Pradeep
>
> On Mon, Sep 10, 2018 at 5:59 PM, Alain RODRIGUEZ 
> wrote:
>
>> Adding a data center for the first time is a bit tricky when you
>> haven't been considering it from the start.
>>
>> I operate 5 nodes cluster (3.11.0) in a single data center with
>>> SimpleSnitch, SimpleStrategy and all client policy RoundRobin.
>>>
>>
>> You will need:
>>
>> - To change clients, m

Re: node replacement failed

2018-09-14 Thread Alain RODRIGUEZ

Hello

Also i could delete system_traces which is empty anyway, but there is a
> system_auth and system_distributed keyspace too and they are not empty,
> Could i delete them safely too?


I would say no, not safely, as I am not sure about some of them, but maybe
this would work. Here is what I know: 'system_traces' can be deleted.
'system_auth'
should be ok to delete if you do not use auth (but I am not 100% sure)
but system_distributed
looks like an important keyspace, I am really uncertain about this one. It
was added in 'recent' Cassandra version and I am not too sure what's in
there, to be honest.

It seems quite a complex situation. A possible way out would be to do the 3
nodes replacement without streaming and then repair (making sure no client
talk to or get information from these nodes meanwhile), but I think
'replace' and 'auto_bootstrap: false' do not work well together.

Another idea is that 'nodetool removenode' on the 3 nodes, 1 by 1, should
also ensure consistency is preserved (if not forced). Then you could
recreate the 3 nodes of this rack. I am afraid this operation might fail as
well though, for the same reason of having ranges unavailable as well.

I am still thinking about it, but before going deeper, is this still an
issue for you at the moment?

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le lun. 10 sept. 2018 à 13:43, onmstester onmstester 
a écrit :

> Thanks Alain,
> First here it is more detail about my cluster:
>
>- 10 racks + 3 nodes on each rack
>- nodetool status: shows 27 nodes UN and 3 nodes all related to single
>rack as DN
>- version 3.11.2
>
> *Option 1: (Change schema and) use replace method (preferred method)*
> * Did you try to have the replace going, without any former repairs,
> ignoring the fact 'system_traces' might be inconsistent? You probably don't
> care about this table, so if Cassandra allows it with some of the nodes
> down, going this way is relatively safe probably. I really do not see what
> you could lose that matters in this table.
> * Another option, if the schema first change was accepted, is to make the
> second one, to drop this table. You can always rebuild it in case you need
> it I assume.
>
> I really love to let the replace going, but it stops with the error:
>
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
>
> Also i could delete system_traces which is empty anyway, but there is a
> system_auth and system_distributed keyspace too and they are not empty,
> Could i delete them safely too?
> If i could just somehow skip streaming the system keyspaces from node
> replace phase, the option 1 would be great.
>
> P.S: Its clear to me that i should use at least RF=3 in production, but
> could not manage to acquire enough resources yet (i hope would be fixed in
> recent future)
>
> Again Thank you for your time
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>  On Mon, 10 Sep 2018 16:20:10 +0430 *Alain RODRIGUEZ
> >* wrote 
>
> Hello,
>
> I am sorry it took us (the community) more than a day to answer to this
> rather critical situation. That being said, my recommendation at this point
> would be for you to make sure about the impacts of whatever you would try.
> Working on a broken cluster, as an emergency might lead you to a second
> mistake, possibly more destructive than the first one. It happened to me
> and around, for many clusters. Move forward even more carefuly in these
> situations as a global advice.
>
> Suddenly i lost all disks of cassandar-data on one of my racks
>
>
> With RF=2, I guess operations use LOCAL_ONE consistency, thus you should
> have all the data in the safe rack(s) with your configuration, you probably
> did not lose anything yet and have the service only using the nodes up,
> that got the right data.
>
>  tried to replace the nodes with same ip using this:
>
> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
>
>
> As a side note, I would recommend you to use 'replace_address_first_boot'
> instead of 'replace_address'. This does basically the same but will be
> ignored after the first bootstrap. A detail, but hey, it's there and
> somewhat safer, I would use this one.
>
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
>
> By default, non-user keyspace use 'SimpleStrategy' and a small RF.
> Ideally, this should be changed in a production cluster, and you're having
> an example of why.
>
> Now when i altered the system_traces keys

Re: Read timeouts when performing rolling restart

2018-09-14 Thread Alain RODRIGUEZ

Hello Riccardo.

Going to VPC, use GPFS and NTS all sounds very reasonable to me. As you
said, that's another story. Good luck with this. Yet the current topology
should also work well and I am wondering why the query does not find any
other replica available.

About your problem at hand:

It's unclear to me at this point if the nodes are becoming unresponsive. My
main thought on your first email was that you were facing some issue were
due to the topology or to the client configuration you were missing
replicas, but I cannot see what's wrong (if not authentication indeed, but
you don't use it).
Then I am thinking it might be indeed due to many nodes getting extremely
busy at the moment of the restart (of any of the nodes), because of this:

After rising the compactors to 4 I still see some dropped messages for HINT
> and MUTATIONS. This happens during startup. Reason is "for internal
> timeout". Maybe too many compactors?

Some tuning information/hints:

* The number of *concurrent_compactor* should be between 1/4 and 1/2 of the
total number of cores and generally no more than 8. It should ideally never
be equal to the number of CPU cores as we want power available to process
requests at any moment.
* Another common bottleneck is the disk throughput. If compactions are
running too fast, it can harm as well. I would fix the number of
concurrent_compactors as mentioned above and act on the compaction
throughput instead
* If hints are a problem, or rather saying, to make sure they are involved
in the issue you see, why not disabling hints completely on all nodes and
try a restart? Anything that can be disabled is an optimization. You do not
need hinted handoff if you run a repair later on (or if you operate with a
strong consistency and do not perform deletes for example). You can give
this a try:
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L44-L46
.
* Not as brutal, you can try slowing down the hint transfer speed:
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L57-L67
.
* Check for GC that would be induced by the pressure put by hints delivery,
compactions and the first load of the memory on machine start. Any GC
activity that would be shown in the logs?
* As you are using AWS, tuning the phi_convict_threshold to 10-12 could
help as well not making the node down (if that's what happens).
* Do you see any specific part of the hardware being the bottleneck or bing
especially strongly used during a restart? Maybe use:
'dstat -D  -lvrn 10' (where  is like 'xvdq'). I believe this
command shows Bytes, not bits, thus ‘50M' is 50 MB or 400 Mb.
* What hardware are you using?
* Could you also run a 'watch -n1 -d "nodetool tpstats"' during the nodes
restart as well and see what threads are 'PENDING' during the restart. For
instance, if the flush_writter is pending, the next write to this table has
to wait for the data to be flushed. It can be multiple things, but having
an interactive view of the pending requests might lead you to the root
cause of the issue.

C*heers,
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 13 sept. 2018 à 09:50, Riccardo Ferrari  a
écrit :

> Hi Shalom,
>
> It happens almost at every restart, either a single node or a rolling one.
> I do agree with you that it is good, at least on my setup, to wait few
> minutes to let the rebooted node to cool down before moving to the next.
> The more I look at it the more I think is something coming from hint
> dispatching, maybe I should try  something around hints throttling.
>
> Thanks!
>
> On Thu, Sep 13, 2018 at 8:55 AM, shalom sagges 
> wrote:
>
>> Hi Riccardo,
>>
>> Does this issue occur when performing a single restart or after several
>> restarts during a rolling restart (as mentioned in your original post)?
>> We have a cluster that when performing a rolling restart, we prefer to
>> wait ~10-15 minutes between each restart because we see an increase of GC
>> for a few minutes.
>> If we keep restarting the nodes quickly one after the other, the
>> applications experience timeouts (probably due to GC and hints).
>>
>> Hope this helps!
>>
>> On Thu, Sep 13, 2018 at 2:20 AM Riccardo Ferrari 
>> wrote:
>>
>>> A little update on the progress.
>>>
>>> First:
>>> Thank you Thomas. I checked the code in the patch and briefly skimmed
>>> through the 3.0.6 code. Yup it should be fixed.
>>> Thank you Surbhi. At the moment we don't need authentication as the
>>> instances are locked down.
>>>
>>> Now:
>>> - Unfortunately the start_transport_native trick does not always work.
>>> On some nodes works on other don't.

Re: Anticompaction causing significant increase in disk usage

2018-09-12 Thread Alain RODRIGUEZ

Hello Martin,

How do you perform the repairs?

Are you using incremental repairs or full repairs but without subranges?
Alex described issues related to these repairs here:
http://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
.

*tl;dr: *

The only way to perform repair without anticompaction in “modern” versions
> of Apache Cassandra is subrange repair, which fully skips anticompaction.
> To perform a subrange repair correctly, you have three options :
> - Compute valid token subranges yourself and script repairs accordingly
> - Use the Cassandra range repair script which performs subrange repair
> - Use Cassandra Reaper, which also performs subrange repair


If you can prevent anti-compaction, disk space growth should be more
predictable.

There might be more solutions now out there, C* should also soon be shipped
with a side-car it's being actively discussed. Finally, Incremental repairs
will receive important fixes in Cassandra 4.0, Alex also wrote about this
too (yes, this guy loves repairs ¯\_(ツ)_/¯)
http://thelastpickle.com/blog/2018/09/10/incremental-repair-improvements-in-cassandra-4.html

I believe (and hope) this information is relevant to help you fix this
issue.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mer. 12 sept. 2018 à 10:14, Martin Mačura  a écrit :

> Hi,
> we're on cassandra 3.11.2 . During an anticompaction after repair,
> TotalDiskSpaceUsed value of one table gradually went from 700GB to
> 1180GB, and then suddenly jumped back to 700GB.  This happened on all
> nodes involved in the repair. There was no change in PercentRepaired
> during or after this process. SSTable count is currently 857, with a
> peak of 2460 during the repair.
>
> Table is using TWCS with 1-day time window.  Most daily SSTables are
> around 1 GB but the oldest one is 156 GB - caused by a major
> compaction.
>
> system.log.6.zip:INFO  [CompactionExecutor:9923] 2018-09-10
> 15:29:54,238 CompactionManager.java:649 - [repair
> #88c36e30-b4cb-11e8-bebe-cd3efd73ed33] Starting anticompaction for ...
> on 519 [...]  SSTables
> ...
> system.log:INFO  [CompactionExecutor:9923] 2018-09-12 00:29:39,262
> CompactionManager.java:1524 - Anticompaction completed successfully,
> anticompacted from 0 to 518 sstable(s).
>
> What could be the cause of the temporary increase, and how can we
> prevent it?  We are concerned about running out of disk space soon.
>
> Thanks for any help
>
> Martin
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: Read timeouts when performing rolling restart

2018-09-12 Thread Alain RODRIGUEZ

Hello Ricardo

How come that a single node is impacting the whole cluster?
>

It sounds weird indeed.

Is there a way to further delay the native transposrt startup?


You can configure 'start_native_transport: false' in 'cassandra.yaml'. (
https://github.com/apache/cassandra/blob/cassandra-3.0.6/conf/cassandra.yaml#L496
)
Then 'nodetool enablebinary' (
http://cassandra.apache.org/doc/latest/tools/nodetool/enablebinary.html)
when you are ready for it.

But I would consider this as a workaround, and it might not even work, I
hope it does though :).

Any hint on troubleshooting it further?
>

The version of Cassandra is quite an early Cassandra 3+. It's probably
worth to consider moving to 3.0.17, if not to solve this issue, not to face
other issues that were fixed since then.
To know if that would really help you, you can go through
https://github.com/apache/cassandra/blob/cassandra-3.0.17/CHANGES.txt

I am not too sure about what is going on, but here are some other things I
would look at to try to understand this:

Are all the nodes agreeing on the schema?
'nodetool describecluster'

Are all the keyspaces using the 'NetworkTopologyStrategy' and a replication
factor of 2+?
'cqlsh -e "DESCRIBE KEYSPACES;" '

What snitch are you using (in cassandra.yaml)?

What does ownership look like?
'nodetool status '

What about gossip?
'nodetool gossipinfo' or 'nodetool gossipinfo | grep STATUS' maybe.

A tombstone issue?
https://support.datastax.com/hc/en-us/articles/204612559-ReadTimeoutException-seen-when-using-the-java-driver-caused-by-excessive-tombstones

Any ERROR or WARN in the logs after the restart on this node and on other
nodes (you would see the tombstone issue here)?
'grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log'

I hope one of those will help, let us know if you need help to interpret
some of the outputs,

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


Le mer. 12 sept. 2018 à 10:59, Riccardo Ferrari  a
écrit :

> Hi list,
>
> We are seeing the following behaviour when performing a rolling restart:
>
> On the node I need to restart:
> *  I run the 'nodetool drain'
> * Then 'service cassandra restart'
>
> so far so good. The load incerase on the other 5 nodes is negligible.
> The node is generally out of service just for the time of the restart (ie.
> cassandra.yml update)
>
> When the node comes back up and switch on the native transport I start see
> lots of read timeouts in our various services:
>
> com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra
> timeout during read query at consistency LOCAL_ONE (1 responses were
> required but only 0 replica responded)
>
> Indeed the restarting node have a huge peak on the system load, because of
> hints and compactions, nevertheless I don't notice a load increase on the
> other 5 nodes.
>
> Specs:
> 6 nodes cluster on Cassandra 3.0.6
> - keyspace RF=3
>
> Java driver 3.5.1:
> - DefaultRetryPolicy
> - default LoadBalancingPolicy (that should be DCAwareRoundRobinPolicy)
>
> QUESTIONS:
> How come that a single node is impacting the whole cluster?
> Is there a way to further delay the native transposrt startup?
> Any hint on troubleshooting it further?
>
> Thanks
>

Re: Default Single DataCenter -> Multi DataCenter

2018-09-10 Thread Alain RODRIGUEZ

Adding a data center for the first time is a bit tricky when you
haven't been considering it from the start.

I operate 5 nodes cluster (3.11.0) in a single data center with
> SimpleSnitch, SimpleStrategy and all client policy RoundRobin.
>

You will need:

- To change clients, make them 'DCAware'. This depends on the client, but
you should be able to find this in your Cassandra driver (client side).
- To change clients, make them use 'LOCAL_' consistency
('LOCAL_ONE'/'LOCAL_QUORUM' being the most common).
- To change 'SimpleSnitch' for 'EC2Snitch' or 'GossipingPropertyFileSnitch'
for example, depending on your context/preference
- To change 'SimpleStrategy' for 'NetworkTopologyStrategy' for all the
keyspaces, with the desired RF. I take the chance to say that switching to
1 replica only is often a mistake, you can indeed have data loss (which you
accept) but also service going down, anytime you restart a node or that a
node goes down. If you are ok with RF=1, RDBMS might be a better choice.
It's an anti-pattern of some kind to run Cassandra with RF=1. Yet up to
you, this is not our topic :). In the same kind of off-topic
recommendations, I would not stick with C*3.11.0, but go to C*3.11.3 (if
you do not perform slice delete, there is still a bug with this apparently)

So this all needs to be done *before* starting adding the new data center.
Changing the snitch is tricky, make sure that the new snitch uses the racks
and dc names currently in use in your cluster for the current cluster, if
not the data could not be accessible after the configuration change.

Then the procedure to add a data center is probably described around. I
know I did this detailed description in 2014, here it is:
https://mail-archives.apache.org/mod_mbox/cassandra-user/201406.mbox/%3cca+vsrlopop7th8nx20aoz3as75g2jrjm3ryx119deklynhq...@mail.gmail.com%3E,
but you might find better/more recent documentation than this one for this
relatively common process, like the documentation you linked.

If you are not confident or have doubts, you can share more about the
context and post your exact plan, as I did years ago in the mail previously
linked. People here should be able to confirm the process is ok before you
move forward, giving you an extra confidence.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 10 sept. 2018 à 11:05, Eunsu Kim  a écrit :

> Hello everyone
>
> I operate 5 nodes cluster (3.11.0) in a single data center with
> SimpleSnitch, SimpleStrategy and all client policy RoundRobin.
>
> At this point, I am going to create clusters of the same size in different
> data centers.
>
> I think these two documents are appropriate, but there is confusion
> because they are referenced to each other.
>
>
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddDCToCluster.html
>
> https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsSwitchSnitch.html
>
> Anyone who can clearly guide the order? Currently RF is 2 and I want to
> have only one replica in the NetworkTopologyStrategy.
> A little data loss is okay.
>
> Thank you in advanced..
>
>
>
>
>
>
>

Re: node replacement failed

2018-09-10 Thread Alain RODRIGUEZ

nt data the time for the
nodes to be repaired.

I must say, that I really prefer odd values for the RF, starting with RF=3.
Using RF=2 you will have to pick. Consistency or Availability. With a
consistency of ONE everywhere, the service is available, no single point of
failure. using anything bigger than this, for writes or read, brings
consistency but it creates single points of failures (actually any node
becomes a point of failure). RF=3 and QUORUM for both write and reads
take the best of the 2 worlds somehow. The tradeoff with RF=3 and quorum
reads is the latency increase and the resource usage.

Maybe is there a better approach, I am not too sure, but I think I would
try option 1 first in any case. It's less destructive, less risky, no token
range movements, no empty nodes available. I am not sure about limitation
you might face though and that's why I suggest a second option for you to
consider if the first is not actionable.

Let us know how it goes,
C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le lun. 10 sept. 2018 à 09:09, onmstester onmstester 
a écrit :

> Any idea?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>  On Sun, 09 Sep 2018 11:23:17 +0430 *onmstester onmstester
> >* wrote 
>
>
> Hi,
>
> Cluster Spec:
> 30 nodes
> RF = 2
> NetworkTopologyStrategy
> GossipingPropertyFileSnitch + rack aware
>
> Suddenly i lost all disks of cassandar-data on one of my racks, after
> replacing the disks, tried to replace the nodes with same ip using this:
>
> https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
>
> starting the to-be-replace-node fails with:
> java.lang.IllegalStateException: unable to find sufficient sources for
> streaming range in keyspace system_traces
>
> the problem is that i did not changed default replication config for
> System keyspaces, but Now when i altered the system_traces keyspace
> startegy to NetworkTopologyStrategy and RF=2
> but then running nodetool repair failed: Endpoint not alive /IP of dead
> node that i'm trying to replace.
>
> What should i do now?
> Can i just remove previous nodes, change dead nodes IPs and re-join them
> to cluster?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>
>

Re: nodetool cleanup - compaction remaining time

2018-09-06 Thread Alain RODRIGUEZ

>
> As far as I can remember, if you have unthrottled compaction, then the
> message is different: it says "n/a".


Ah right!

I am now completely convinced this needs a JIRA as well (indeed, if it's
not fixed in C*3+, as Jeff mentioned).
Thanks for the feedback Alex.

Le jeu. 6 sept. 2018 à 11:06, Oleksandr Shulgin <
oleksandr.shul...@zalando.de> a écrit :

> On Thu, Sep 6, 2018 at 11:50 AM Alain RODRIGUEZ 
> wrote:
>
>>
>> Be aware that this behavior happens when the compaction throughput is set
>> to *0 *(unthrottled/unlimited). I believe the estimate uses the speed
>> limit for calculation (which is often very much wrong anyway).
>>
>
> As far as I can remember, if you have unthrottled compaction, then the
> message is different: it says "n/a".  The all zeroes you usually see when
> you only have Validation compactions, and apparently Cleanup work the same
> way, at least in the 2.1 version.
>
>
> https://github.com/apache/cassandra/blob/06209037ea56b5a2a49615a99f1542d6ea1b2947/src/java/org/apache/cassandra/tools/nodetool/CompactionStats.java#L102
>
> Actually, if you look closely, it's obvious that only real Compaction
> tasks count toward remainingBytes, so all Validation/Clenaup/Upgrade don't
> count.  The reason must be that only actual compaction is affected by the
> throttling parameter.  Is that assumption correct?
>
> In any case it would make more sense to measure the actual throughput to
> provide an accurate estimate.  Not sure if there is JIRA issue for that
> already.
>
> --
> Alex
>
>

Re: nodetool cleanup - compaction remaining time

2018-09-06 Thread Alain RODRIGUEZ

Hello Thomas.

Be aware that this behavior happens when the compaction throughput is set
to *0 *(unthrottled/unlimited). I believe the estimate uses the speed limit
for calculation (which is often very much wrong anyway).

I just meant to say, you might want to make sure that it's due to cleanup
type of compaction indeed and not due to some changes you could have made
in the compaction throughput threshold.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le jeu. 6 sept. 2018 à 06:51, Jeff Jirsa  a écrit :

> Probably worth a JIRA (especially if you can repro in 3.0 or higher, since
> 2.1 is critical fixes only)
>
> On Wed, Sep 5, 2018 at 10:46 PM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>> Hello,
>>
>>
>>
>> is it a known issue / limitation that cleanup compactions aren’t counted
>> in the compaction remaining time?
>>
>>
>>
>> nodetool compactionstats -H
>>
>>
>>
>> pending tasks: 1
>>
>>compaction type   keyspace   table   completed total
>> unit   progress
>>
>>CleanupXXX YYY   908.16 GB   1.13 TB
>> bytes 78.63%
>>
>> Active compaction remaining time :   0h00m00s
>>
>>
>>
>>
>>
>> This is with 2.1.18.
>>
>>
>>
>>
>>
>> Thanks,
>>
>> Thomas
>>
>>
>> The contents of this e-mail are intended for the named addressee only. It
>> contains information that may be confidential. Unless you are the named
>> addressee or an authorized designee, you may not copy or use it, or
>> disclose it to anyone else. If you received it in error please notify us
>> immediately and then destroy it. Dynatrace Austria GmbH (registration
>> number FN 91482h) is a company registered in Linz whose registered office
>> is at 4040 Linz, Austria, Freistädterstraße 313
>>
>

Re: commitlog content

2018-08-30 Thread Alain RODRIGUEZ

Hello Vitaly.

This sounds weird to me (unless we are speaking about a small size MB, a
few GB maybe). Then the commit log size is limited, by default (see below)
and the data should grow bigger in most cases.

According to the documentation (
http://cassandra.apache.org/doc/latest/architecture/storage_engine.html#commitlog
):

commitlog_total_space_in_mb: Total space to use for commit logs on disk.
> If space gets above this value, Cassandra will flush every dirty CF in the
> oldest segment and remove it. So a small total commitlog space will tend to
> cause more flush activity on less-active columnfamilies.
> The default value is the smaller of 8192, and 1/4 of the total space of
> the commitlog volume.
> Default Value: 8192


The commit log is supposed to be cleaned on flush, thus the solution to
reduce the disk space used by commit logs are multiple:
- Decrease the value for 'commitlog_total_space_in_mb' (probably the best
option, you say what you want, and you get it)
- Use the table option 'memtable_flush_period_in_ms' (default is 0, pick
what you would like here - has to be done on all the table you want it to
apply)
- Manually run: 'nodetool flush' should also clean the commit logs
- Reduce the size of the memtables
- Limit the maximum size per table before a flush is triggered with
'memtable_cleanup_threshold'. According to the doc it's not a good idea
though (
http://cassandra.apache.org/doc/latest/configuration/cassandra_config_file.html#memtable-cleanup-threshold
).

Also, the data in Cassandra is compacted and compressed. Over a short time
period of test or if the data is small compared to the memory available and
fits mostly in memory, I can imagine that what you describe can happen.

C*heers,
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

Le mar. 28 août 2018 à 18:24, Vitaliy Semochkin  a
écrit :

> Hello,
>
> I've noticed that after a stress test that does only inserts a
> commitlog content exceeds data dir 20 times.
> What can be cause of such behavior?
>
> Running nodetool compact did not change anything.
>
> Regards,
> Vitaliy
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: dynamic_snitch=false, prioritisation/order or reads from replicas

2018-08-07 Thread Alain RODRIGUEZ

Hello Kyrill,

But in case of CL=QUORUM/LOCAL_QUORUM, if I'm not wrong, read request is
> sent to all replicas waiting for first 2 to reply.
>

My understanding is that this sentence is wrong. It is as you described it
for writes indeed, all the replicas got the information (and to all the
data centers). It's not the case for reads. For reads, x nodes are picked
and used (x = ONE, QUORUM, ALL, ...).

Looks like the only change for dynamic_snitch=false is that "data" request
> is sent to a determined node instead of "currently the fastest one".
>

Indeed, the problem is that the 'currently the fastest one' changes very
often in certain cases, thus removing the efficiency from the cache without
enough compensation in many cases.
The idea of not using the 'bad' nodes is interesting to have more
predictable latencies when a node is slow for some reason. Yet one of the
side effects of this (and of the scoring that does not seem to be
absolutely reliable) is that the clients are often routed to distinct nodes
when under pressure, due to GC pauses for example or any other pressure.
Saving disk reads in read-heavy workloads under pressure is more important
than trying to save a few milliseconds picking the 'best' node I guess.
I can imagine that alleviating these disks, reducing the number of disk
IO/throughput ends up lowering the latency for all the nodes, thus the
client application latency improves overall. That is my understanding of
why it is so often good to disable the dynamic_snitch.

Did you get improved response for CL=ONE only or for higher CL's as well?
>

I must admit I don't remember for sure, but many people are using
'LOCAL_QUORUM' and I think I saw this for this consistency level as well.
Plus this question might no longer stand as reads in Cassandra work
slightly differently than what you thought.

I am not 100% comfortable with this 'dynamic_snitch theory' topic, so I
hope someone else can correct me if I am wrong, confirm or add information
:). But for sure I have seen this disabled giving some really nice
improvement (as many others here as you mentioned). Sometimes it was not
helpful, but I have never seen this change being really harmful though.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-08-06 22:27 GMT+01:00 Kyrylo Lebediev :

> Thank you for replying, Alain!
>
>
> Better use of cache for 'pinned' requests explains good the case when
> CL=ONE.
>
>
> But in case of CL=QUORUM/LOCAL_QUORUM, if I'm not wrong, read request is
> sent to all replicas waiting for first 2 to reply.
>
> When dynamic snitching is turned on, "data" request is sent to "the
> fastest replica", and "digest" requests - to the rest of replicas.
>
> But anyway digest is the same read operation [from SSTables through
> filesystem cache] + calculating and sending hash to coordinator. Looks like
> the only change for dynamic_snitch=false is that "data" request is sent to
> a determined node instead of "currently the fastest one".
>
> So, if there are no mistakes in above description, improvement shouldn't
> be much visible for CL=*QUORUM...
>
>
> Did you get improved response for CL=ONE only or for higher CL's as well?
>
>
> Indeed an interesting thread in Jira.
>
>
> Thanks,
>
> Kyrill
> --
> *From:* Alain RODRIGUEZ 
> *Sent:* Monday, August 6, 2018 8:26:43 PM
> *To:* user cassandra.apache.org
> *Subject:* Re: dynamic_snitch=false, prioritisation/order or reads from
> replicas
>
> Hello,
>
>
> There are reports (in this ML too) that disabling dynamic snitching
> decreases response time.
>
>
> I confirm that I have seen this improvement on clusters under pressure.
>
> What effects stand behind this improvement?
>
>
> My understanding is that this is due to the fact that the clients are then
> 'pinned', more sticking to specific nodes when the dynamic snitching is
> off. I guess there is a better use of caches and in-memory structures,
> reducing the amount of disk read needed, which can lead to way more
> performances than switching from node to node as soon as the score of some
> node is not good enough.
> I am also not sure that the score calculation is always relevant, thus
> increasing the threshold before switching reads to another node is still
> often worst than disabling it completely. I am not sure if the score
> calculation was fixed, but in most cases, I think it's safer to run with
> 'dynamic_snitch: false'. Anyway, it's possible to test it on a canary node
> (or entire rack) and look at the p99 for read latencies for example :).
>
> This ticket is old, but was precisely on

Re: dynamic_snitch=false, prioritisation/order or reads from replicas

2018-08-06 Thread Alain RODRIGUEZ

Hello,


> There are reports (in this ML too) that disabling dynamic snitching
> decreases response time.


I confirm that I have seen this improvement on clusters under pressure.

What effects stand behind this improvement?
>

My understanding is that this is due to the fact that the clients are then
'pinned', more sticking to specific nodes when the dynamic snitching is
off. I guess there is a better use of caches and in-memory structures,
reducing the amount of disk read needed, which can lead to way more
performances than switching from node to node as soon as the score of some
node is not good enough.
I am also not sure that the score calculation is always relevant, thus
increasing the threshold before switching reads to another node is still
often worst than disabling it completely. I am not sure if the score
calculation was fixed, but in most cases, I think it's safer to run with
'dynamic_snitch: false'. Anyway, it's possible to test it on a canary node
(or entire rack) and look at the p99 for read latencies for example :).

This ticket is old, but was precisely on that topic:
https://issues.apache.org/jira/browse/CASSANDRA-6908

C*heers
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-08-04 15:37 GMT+02:00 Kyrylo Lebediev :

> Hello!
>
>
> In case when dynamic snitching is enabled data is read from 'the fastest
> replica' and other replicas send digests for CL=QUORUM/LOCAL_QUORUM .
>
> When dynamic snitching is disabled, as the concept of the fastest replica
> disappears, which rules are used to choose from which replica to read
> actual data (not digests):
>
>  1) when all replicas are online
>
>  2) when the node primarily responsible for the token range is offline.
>
>
> There are reports (in this ML too) that disabling dynamic snitching
> decreases response time.
>
> What effects stand behind this improvement?
>
>
> Regards,
>
> Kyrill
>

Re: concurrent_compactors via JMX

2018-07-24 Thread Alain RODRIGUEZ

Hello Ricardo,

My understanding is that GP2 is better. I think we did some testing in the
past, but I must say I do not remember the exact results. I remember we
also thought of IO1 at some point, but we were not convinced by this kind
of EBS (not sure if it was not as performant as suggested in the doc or
just much more expensive). Maybe test it and make your own idea or wait for
someone else's information.

Be aware that the size of the GP2 EBS is impacting the IOPS, the max IOPS
is reached at ~ 3.334 TB which is also a good dataset size for Cassandra
(1.5/2 TB with some spared space for compactions).

I'd like to deploy on i3.xlarge
>

Yet if you go for I3, of course, use the ephemeral drives (NVMe). It's
incredibly fast ;-). Compared with m1.xlarge you should see a
substantial difference. The problem is that with a low number of nodes, it
will always cost more to have i3 than m1. This is often not the case with
more machines, as each node will work way more efficiently and you can
effectively reduce the number of nodes. Here, 3 will probably be the
minimum of nodes and 3 x i3 might cost more than 5/6 x m1 instances. When
scaling up though, you should come back to an acceptable cost/efficiency.
It's your call to see if to continue with m1, m5 or r4 instances meanwhile.

I decided to get safe and scale horizontally with the hardware we have
> tested
>

Yes, this is fair enough and a safe approach. To add new hardware the best
approach is a data center switch (I will write a post about how to do this
sometime soon)

I'm preparing to migrate inside vpc
>

This too is probably through a DC switch. I reminded I asked for help on
this in 2014, I found the reference for you where I published the steps
that I went through to go from EC2 public --> public VPC  --> private VPC.
It's old and I did not read it again, but it worked for us and at that
time. I hope you might find it useful as well as the process is detailed
step by step. It should be  easy to adapt it and you should not forget any
step this way:
http://grokbase.com/t/cassandra/user/1465m9txtw/vpc-aws#20140612k7xq0t280cvyk6waeytxbkx40c

possibly in Multi-AZ.
>

Yes, I recommend you to do this. It's incredibly powerful when you now that
with 3 racks and a RF=3 (and proper topology/configuration), each rack owes
100% of the data. Thus when operating you can work on a rack at once with
limited risk, even using quorum, service should stay up, no matter what
happens as long as 2 AZs are completely available. When cluster will grow
you might really appreciate this to prevent some failures and operate
safely.

PS: I defintely own you a coffee, actually much more than that!

If we meet we can definitely share a beer (no coffee for me, but I never
say no to a beer ;-)).
But you don't owe me it was and still is for free. Here we all share,
period. I like to think that knowledge is the only wealth you can give away
while keeping it for yourself. Some even say that knowledge grows when
shared. I used this mailing list myself to ramp up with Cassandra, I am
myself probably paying back to the community somehow for years now :-). Now
it's even part of my job, this is a part of what we do :). And I like it.
What I invite you to do is help people around yourself when you will be
comfortable with some topics. This way someone else might enjoy this
mailing list, making it a nicer place and contributing to growing up the
community ;-).

Yet, be ensured I appreciate the feedback and that you are grateful, it
shows it was somehow useful to you. This is enough for me.

C*heers
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-19 19:21 GMT+02:00 Riccardo Ferrari :

> Alain,
>
> I really appreciate your answers! A little typo is not changing the
> valuable content! For sure I will give a shot to your GC settings and come
> back with my findings.
> Right now I have 6 nodes up and running and everything looks good so far
> (at least much better).
>
> I agree, the hardware I am using is quite old but rather experimenting
> with new hardware combinations (on prod) I decided to get safe and scale
> horizontally with the hardware we have tested. I'm preparing to migrate
> inside vpc and I'd like to deploy on i3.xlarge instances and possibly in
> Multi-AZ.
>
> Speaking of EBS: I gave a quick I/O test to m3.xlarge + SSD + EBS (400
> PIOPS). SSD looks great for commitlogs, EBS I might need more guidance. I
> certainly gain in terms of random i/o however I'd like to hear what is your
> stand wrt IO2 (PIOPS) vs regular GP2? Or better: what are you guidelines
> when using EBS?
>
> Thanks!
>
> PS: I defintely own you a coffee, actually much more than that!
>
> On Thu, Jul 19, 2018 at 6:24 PM, Alain RODRIGUEZ 
> wrote:
>
>> Ah excuse my confusio

Re: concurrent_compactors via JMX

2018-07-19 Thread Alain RODRIGUEZ

> Ah excuse my confusion. I now understand I guide you through changing the
> throughput when you wanted to change the compaction throughput.



Wow, I meant to say "I guided you through changing the compaction
throughput when you wanted to change the number of concurrent compactors."

I should not answer messages before waking up fully...

:)

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-19 14:07 GMT+01:00 Alain RODRIGUEZ :

> Ah excuse my confusion. I now understand I guide you through changing the
> throughput when you wanted to change the compaction throughput.
>
> I also found some commands I ran in the past using jmxterm. As mentioned
> by Chris - and thanks Chris for answering the question properly -, the
> 'max' can never be lower than the 'core'.
>
> Use JMXTERM to REDUCE the concurrent compactors:
>
> ```
> # if we have more than 2 threads:
> echo "set -b org.apache.cassandra.db:type=CompactionManager
> CoreCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager
> MaximumCompactorThreads 2" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199
> ```
>
> Use JMXTERM to INCREASE the concurrent compactors:
>
> ```
> # if we have currently less than 6 threads:
> echo "set -b org.apache.cassandra.db:type=CompactionManager
> MaximumCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199 && echo "set -b org.apache.cassandra.db:type=CompactionManager
> CoreCompactorThreads 6" | java -jar /opt/tlp/jmxterm-1.0.0-uber.jar -l
> 127.0.0.1:7199
> ```
>
> Some comments about the information you shared, as you said, 'thinking out
> loud' :):
>
> *About the hardware*
>
> I remember using the 'm1.xlarge' :). They are not that recent. It will
> probably worth it to reconsider this hardware choice and migrate to newer
> hardware (m5/r4 + EBS GP2 or I3 with ephemeral). You should be able to
> reduce the number of nodes and make it equivalent (or maybe slightly more
> expensive but so it works properly). I once moved from a lot of these nodes
> (80ish) to a few I2 instances (5 - 15? I don't remember). Latency went from
> 20 ms to 3 - 5 ms (and was improved later on). Also, using the right
> hardware for your case should avoid headaches to you and your team. I
> started with t1.micro in prod and went all the way up (m1.small, m1.medium,
> ...). It's good for learning, not for business.
>
> Especially, this does not work well together:
>
> my instances are still on magnetic drivers
>>
>
> with
>
> most tables on LCS
>
> frequent r/w pattern
>>
>
> Having some SSDs here (EBS GP2 or even better I3 - NVMe disks) would most
> probably help to reduce the latency. I would also pick an instance with
> more memory (30 GB would probably be more comfortable). The more memory,
> the better it is possible to tune the JVM and the more page caching can be
> done (thus avoiding some disk reads). Given the number of nodes you use,
> it's complex to keep the cost low doing this change. When the cluster will
> grow you might want to consider changing the instance type again and maybe
> for now just take a r4.xlarge + EBS Volume GP2. It comes with 30+ GB of
> memory and the same number of cpu (or more) and see how many nodes are
> needed. It might be slightly more expensive, but I really believe it could
> do  some good.
>
> As a middle term solution, I think you might be really happy with a change
> of this kind.
>
> *About DTCS/TWCS?*
>
>
>>
>> * - few tables with DTCS- need to upgrade to 3.0.8 for TWCS*
>
> Indeed switching to DTCS rather than TWCS can be a real relief for a
> cluster. You should not have to wait to upgrade to 3.0.8 to use TWCS. I
> must say I am not too sure for 3.0.x (x < 8) versions though. Maybe giving
> a try to http://thelastpickle.com/blog/2017/01/10/twcs-part2.html with
> https://github.com/jeffjirsa/twcs/tree/cassandra-3.0.0 is easier for you?
>
> *Garbage Collection?*
>
> That being said, the CPU load is really high, I suspect Garbage Collection
> is taking a lot of time to the nodes of this cluster. It is probably not
> helping the CPUs either. This might even be the biggest pain point for this
> cluster.
>
> Would you like to try using following settings on a canary node and see
> how it goes? These settings are quite arbitrary. With the gc.log I could be
> more precise on what I believe is a correct setting.
>
> GC Type: CMS
> Heap: 8 GB (could be bigger,

Re: concurrent_compactors via JMX

2018-07-19 Thread Alain RODRIGUEZ

on. SurvivorRatio
can also be moved down to 2 or 4, if you want to play around and check the
difference.

Make sure to use a canary node first, there is no 'good' configuration
here, it really depends on the workload and the settings above could harm
the cluster.

I think we can make more of these instances. Nonetheless after adding a few
more nodes, scaling up the instance type instead of the number of nodes to
have SSDs and bit more of memory will make things smoother, and probably
cheaper as well at some point.




2018-07-18 17:27 GMT+01:00 Riccardo Ferrari :

> Chris,
>
> Thank you for mbean reference.
>
> On Wed, Jul 18, 2018 at 6:26 PM, Riccardo Ferrari 
> wrote:
>
>> Alain, thank you for email. I really really appreciate it!
>>
>> I am actually trying to remove the disk io from the suspect list, thus
>> I'm want to reduce the number of concurrent compactors. I'll give
>> thorughput a shot.
>> No, I don't have a long list of pending compactions, however my instances
>> are still on magnetic drivers and can't really afford high number of
>> compactors.
>>
>> We started to have slow downs and most likely we were undersized, new
>> features are coming in and I want to be ready for them.
>> *About the issue:*
>>
>>
>>- High system load on cassanda nodes. This means top saying 6.0/12.0
>>on a 4 vcpu instance (!)
>>
>>
>>- CPU is high:
>>  - Dynatrace says 50%
>>  - top easily goes to 80%
>>   - Network around 30Mb (according to Dynatrace)
>>   - Disks:
>>  - ~40 iops
>>  - high latency: ~20ms (min 8 max 50!)
>>  - negligible iowait
>>  - testing an empty instance with fio I get 1200 r_iops / 400
>>  w_iops
>>
>>
>>- Clients timeout
>>   - mostly when reading
>>   - few cases when writing
>>- Slowly growing number of "All time blocked of Native T-R"
>>   - small numbers: hundreds vs millions of successfully serverd
>>   requests
>>
>> The system:
>>
>>- Cassandra 3.0.6
>>   - most tables on LCS
>>  - frequent r/w pattern
>>   - few tables with DTCS
>>  - need to upgrade to 3.0.8 for TWCS
>>  - mostly TS data, stream write / batch read
>>   - All our keyspaces have RF: 3
>>
>>
>>- All nodes on the same AZ
>>- m1.xlarge
>>- 4x420 drives (emphemerial storage) configured in striping (raid0)
>>   - 4 vcpu
>>   - 15GB ram
>>- workload:
>>   - Java applications;
>>  - Mostly feeding cassandra writing data coming in
>>  - Apache Spark applications:
>>  - batch processes to read and write back to C* or other systems
>>  - not co-located
>>
>> So far my effort was put into growing the ring to better distribute the
>> load and decrease the pressure, including:
>>
>>- Increasing the node number from 3 to 5 (6th node joining)
>>- jvm memory "optimization":
>>- heaps were set by default script to something bit smaller that 4GB
>>   with CMS gc
>>   - gc pressure was high / long gc pauses
>>  - clients were suffering of read timeouts
>>   - increased the heap still using CMS:
>>  - very long GC pauses
>>  - not much tuning around CMS
>>  - switched to G1 and forced 6/7GB heap on each node using
>>   almost suggested settings
>>   - much more stable
>> - generally < 300ms
>>  - I still have long pauses from time to time (mostly around
>>  1200ms, sometimes on some nodes 3000)
>>
>> *Thinking out loud:*
>> Things are much better, however I still see a high cpu usage specially
>> when Spark kicks even though spark jobs are very small in terms of
>> resources (single worker with very limited parallelism).
>>
>> On LCS tables cfstats reports single digit read latencies and generally
>> 0.X write latencies (as per today).
>> On DTCS tables I have 0.x ms write latency but still double digit read
>> latency, but I guess I should spend some time to tune that or upgrade and
>> move away from DTCS :(
>> Yes, Saprk reads mostly from DTCS tables
>>
>> Still is kinda common to to have dropped READ, HINT and MUTATION.
>>
>>- not on all nodes
>>- this generally happen on node restart
>>
>>
>> On a side note I tried to install libjemalloc1 from Ubuntu repo (mixed
>> 14.04 and 16.04) with terrible results, much s

Re: apache cassandra development process and future

2018-07-18 Thread Alain RODRIGUEZ

Hello,

It's a complex topic that has already been extensively discussed (at least
for the part about Datastax). I am sharing my personal understanding, from
what I read in the mailing list mostly:

Recently Cassandra eco system became very fragmented
>

I would not put Scylladb in the same 'eco system' than Apache Cassandra. I
believed it is inspired by Cassandra and claim to be compatible with it up
to a certain point, but it's not the same software, thus not the same users
and community.

About Datastax, I think they will give you a better idea of their position
by themselves here or through their support. I believe they also
communicated about it already. But in any case, I see Datastax more in the
same 'eco system' than Scylladb. Datastax uses a patched/forked version of
Cassandra (+ some other tools integrated with Cassandra and support). Plus
it goes both ways, Datastax greatly contributed to making Cassandra what it
is now and relies on it (or use to do so at least). I don't think that's
the case for Scylladb I don't see that much interest in
connection/exchanges with Scylladb, I mean no more than exchanging about
DynamoDB for example. We can make standards, compatibles features, compare
performances, etc, but it's not the same code base.

Since Datastax used to be the major participant to Cassandra
> development and now it looks it goes on is own way, what is going to
> be with the Apache Cassandra?
>

Well, this is a fair point, that was discussed in the past, but to make it
short, Apache Cassandra is not dead or anything close. There is a lot of
activity. Some people are stepping out, other stepping in, and other
companies and individual are actively contributing to Cassandra. A version
4.0 of Cassandra is being actively worked on at the moment. If these topics
are of interest, you might want to join the "Cassandra dev" mailing list (
http://cassandra.apache.org/community/).

If there are any other active participants in development?
>

Yes, directly or by open sourcing internal tools quite a few companies have
contributed and continue to contribute to the Apache Cassandra ecosystem. I
invite you to have a look directly at this dev mailing list and check
people's email, profiles or companies. Check the Jira as well :). I am not
into doing this kind of stuff that much myself, I am not following this
closely but I can name for sure Apple, Netflix, The Last Pickle (my
company), Instaclustr I believe as well and many others that I am sorry not
to name here.

Some people are working on Apache Cassandra for years and are around to
help regularly, they changed company but are still working on Cassandra, or
even changed company to work more with Apache Cassandra in some cases.

I'm also interested which distribution is the most popular at the
> moment in production?


I would say now you should start with C*3.0.last or C* 3.11.last. It seems
to be the general consensus in the mailing list lately.
For Scylladb and Datastax I don't know about the version to use. You should
ask them directly.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-18 12:39 GMT+01:00 Vitaliy Semochkin :

> Hi,
>
> Recently Cassandra eco system became very fragmented:
>
> Scylladb provides solution based on Cassandra wire protocol claiming
> it is 10 times faster than Cassandra.
>
> Datastax provides it's own solution called DSE claiming it is twice
> faster than Cassandra.
> Also their site says "DataStax no longer supports the DataStax
> Community version of Apache Cassandra™ or the DataStax Distribution of
> Apache Cassandra™.
> Is their new software incompatible with Cassandra?
> Since Datastax used to be the major participant to Cassandra
> development and now it looks it goes on is own way, what is going to
> be with the Apache Cassandra?
> If there are any other active participants in development?
>
> I'm also interested which distribution is the most popular at the
> moment in production?
>
> Best Regards,
> Vitaliy
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>

Re: concurrent_compactors via JMX

2018-07-17 Thread Alain RODRIGUEZ

eresting:
http://thelastpickle.com/blog/2018/04/11/gc-tuning.html. GC pressure is a
very common way to optimize a Cassandra cluster, to adapt it to your
workload/hardware.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2018-07-17 17:23 GMT+01:00 Riccardo Ferrari :

> Hi list,
>
> Cassandra 3.0.6
>
> I'd like to test the change of concurrent compactors to see if it helps
> when the system is under stress.
>
> Can someone point me to the right mbean?
> I can not really find good docs about mbeans (or tools ...)
>
> Any suggestion much appreciated, best
>

Re: CPU Spike with Jmx_exporter

2018-07-10 Thread Alain RODRIGUEZ

Hello,

I did not work with the 'jmx_exporter' in a production cluster, but for
datadog agent and other collectors I could work with, the number of metrics
being collected was a key point.

Cassandra exposes a lot of metrics and I saw datadog agents taking too much
CPU, I even saw Graphite servers falling because of the load due to some
Cassandra nodes sending metrics. I would recommend you to make sure that
you are filtering-in only metrics that are used to display some charts or
used for alerting purposes. Restrict the pattern for the rules as much as
possible.

Also, for datadog agents, some work was done in the latest version so
metric collection requires less CPU. Maybe is there a similar update that
was released or that you could ask for. Also, CPU might be related to GC as
I see the agent is running inside a JVM, some tuning might help.

If you really cannot do much to improve it on your side, I would open an
issue or a discussion on prometheus side (
https://github.com/prometheus/jmx_exporter/issues maybe?).

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com



2018-07-05 20:39 GMT+01:00 rajpal reddy :

> We are seeing the CPU spike only when Jmx metrics are exposed using
> Jmx_exporter.  tried setting up imx authentication still see cpu spike. if
> i stop using jmx exporter  we don’t see any cpu spike. is there any thing
> we have to tune to make work with Jmx_exporter?
>
>
> On Jun 14, 2018, at 2:18 PM, rajpal reddy 
> wrote:
>
> Hey Chris,
>
> Sorry to bother you. Did you get a chance to look at the gclog file I sent
> last night.
>
> On Wed, Jun 13, 2018, 8:44 PM rajpal reddy 
> wrote:
>
>> Chris,
>>
>> sorry attached wrong log file. attaching gc collection seconds and cpu.
>> there were going high at the same time and also attached the gc.log.
>> grafana dashboard and gc.log timing are 4hours apart gc can be see 06/12th
>> around 22:50
>>
>> rate(jvm_gc_collection_seconds_sum{"}[5m])
>>
>> > On Jun 13, 2018, at 5:26 PM, Chris Lohfink  wrote:
>> >
>> > There are not even a 100ms GC pause in that, are you certain theres a
>> problem?
>> >
>> >> On Jun 13, 2018, at 3:00 PM, rajpal reddy 
>> wrote:
>> >>
>> >> Thanks Chris I did attached the gc logs already. reattaching them
>> now.
>> >>
>> >> it started yesterday around 11:54PM
>> >>> On Jun 13, 2018, at 3:56 PM, Chris Lohfink 
>> wrote:
>> >>>
>> >>>> What is the criteria for picking up the value for G1ReservePercent?
>> >>>
>> >>>
>> >>> it depends on the object allocation rate vs the size of the heap.
>> Cassandra ideally would be sub 500-600mb/s allocations but it can spike
>> pretty high with something like reading a wide partition or repair
>> streaming which might exceed what the g1 ygcs tenuring and timing is
>> prepared for from previous steady rate. Giving it a bigger buffer is a nice
>> safety net for allocation spikes.
>> >>>
>> >>>> is the HEAP_NEWSIZE is required only for CMS
>> >>>
>> >>>
>> >>> it should only set Xmn with that if using CMS, with G1 it should be
>> ignored or else yes it would be bad to set Xmn. Giving the gc logs will
>> give the results of all the bash scripts along with details of whats
>> happening so its your best option if you want help to share that.
>> >>>
>> >>> Chris
>> >>>
>> >>>> On Jun 13, 2018, at 12:17 PM, Subroto Barua <
>> sbarua...@yahoo.com.INVALID > wrote:
>> >>>>
>> >>>> Chris,
>> >>>> What is the criteria for picking up the value for G1ReservePercent?
>> >>>>
>> >>>> Subroto
>> >>>>
>> >>>>> On Jun 13, 2018, at 6:52 AM, Chris Lohfink 
>> wrote:
>> >>>>>
>> >>>>> G1ReservePercent
>> >>>>
>> >>>> 
>> -
>> >>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> >>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> >>>>
>> >>>
>> >>>
>> >>> -
>> >>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> >>> For additional commands, e-mail: user-h...@cassandra.apache.org
>> >>>
>> >>
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> >> For additional commands, e-mail: user-h...@cassandra.apache.org
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> > For additional commands, e-mail: user-h...@cassandra.apache.org
>> >
>>
>>
>

Re: Write Time of a Row in Multi DC Cassandra Cluster

2018-07-10 Thread Alain RODRIGUEZ

Hello,

 I have multi DC (3 DC's) Cassandra cluster/ring - One of the application
> wrote a row to DC1(using Local Quorum)  and within span of 50 ms, it tried
> to read same row from DC2 and could not find the row.

 [...]

So how to determine when the row is actually written in each DC?


To me, this guarantee you try to achieve could obtained using 'EACH_QUORUM'
for writes (ie 'local_quorum' on each DC), and 'LOCAL_QUORUM' for reads for
example. You would then have a strong consistency, as long as the same
client application is running write then read or that it sends a trigger
for the second call sequentially, after validating the write, in some way.

Our both DC's have sub milli second latency at network level, usually <2
> ms. We promised 20 ms consistency. In this case Application could not find
> the row in DC2 in 50 ms
>

In these conditions, using 'EACH_QUORUM' might not be too much of a burden
for the coordinator and the client. The writes are already being processed,
this would increase the latency at the coordinator level (and thus at the
client level), but you would be sure that all the clusters have the row in
a majority of the replicas before triggering the read.

C*heers,
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2018-07-10 8:24 GMT+01:00 Simon Fontana Oscarsson <
simon.fontana.oscars...@ericsson.com>:

> Have you tried trace?
> --
> SIMON FONTANA OSCARSSON
> Software Developer
>
> Ericsson
> Ölandsgatan 1
> 37133 Karlskrona, Sweden
> simon.fontana.oscars...@ericsson.com
> www.ericsson.com
>
> On mån, 2018-07-09 at 19:30 +, Saladi Naidu wrote:
> > Cassandra is an eventual consistent DB, how to find when a row is
> actually written in multi DC environment? Here is the problem I am trying
> to solve
> >
> > - I have multi DC (3 DC's) Cassandra cluster/ring - One of the
> application wrote a row to DC1(using Local Quorum)  and within span of 50
> ms, it tried to read same row from DC2 and could not find the
> > row. Our both DC's have sub milli second latency at network level,
> usually <2 ms. We promised 20 ms consistency. In this case Application
> could not find the row in DC2 in 50 ms
> >
> > I tried to use "select WRITETIME(authorizations_json) from
> token_authorizations where " to find  when the Row is written in each
> DC, but both DC's returned same Timestamp. After further research
> > I found that Client V3 onwards Timestamp is supplied at Client level so
> WRITETIME does not help "https://docs.datastax.com/en/
> developer/java-driver/3.4/manual/query_timestamps/"
> >
> > So how to determine when the row is actually written in each DC?
> >
> >
> > Naidu Saladi
>

Re: Paging in Cassandra

2018-07-10 Thread Alain RODRIGUEZ

Hello,

It sounds like a client/coding issue. People are working with distinct
clients to connect to Cassandra. And it looks like there are not many
'spring-data-cassandra' users around ¯\_(ツ)_/¯.

You could try giving a try there see if you have more luck:
https://spring.io/questions.

C*heers,

Alain

2018-07-05 6:21 GMT+01:00 Ghazi Naceur :

> Hello Eveyone,
>
> I'm facing a problem with CassandraPageRequest and Slice.
> In fact, I'm always obtaining the same Slice and I'm not able to get the
> next slice (or Page) of data.
> I'm based on this example :
>
> Link : https://github.com/spring-projects/spring-data-cassandra/pull/114
>
>
> Query query = 
> Query.empty().pageRequest(CassandraPageRequest.first(10));Slice slice = 
> template.slice(query, User.class);
> do {
> // consume slice
> if (slice.hasNext()) {
> slice = template.select(query, slice.nextPageable(), User.class);
> } else {break;
> }
> } while (!slice.getContent().isEmpty());
>
>
>
> I appreciate your help.
>

Re: Debugging high coordinator latencies

2018-07-04 Thread Alain RODRIGUEZ

Hello,

If your problem is in the read path, there are things you can check to see
what's wrong along the way:

- 'nodetool tablestats' (look at the more important tables first - biggest
volume/throughput). A lot of information at the table level, very useful to
troubleshoot. If you have any question with the interpretation of the
output, just let us know :).
- 'nodetool tablehistogram' will give you information on the local reads.
Ideally latencies are low around the millisecond (or maybe a few ms,
depends on disk speed, caches...)  this should allow understanding if the
local reads are performant or not, it might not be related to
'coordination'. Also see the number of sstables hit per read. This should
be as low as possible, as each hit to another SSTable should touch the
disk, which is the slowest part of our infrastructures. Compaction strategy
and tuning can help here.
- 'nodetool tpstats' - This shows the thread pool stats. Here we are
especially interested in the pending/blocked/dropped tasks. It's sometimes
enlightening to use the following command during a peak in traffic when the
pressure is high: 'watch -d nodetool tpstats'
- 'nodetool compactionstats -H' - Make sure compactions are keeping up, in
particular if reads are hitting a lot of sstables.
- You can trace some queries using cqlsh for example. See what is slow - if
you manage to find out which query is slow.

Beyond that, debugging usually takes a heap dump and inspection with
> yourkit or MAT or similar
>

Yes. About similar tools, you can also give a try to
https://github.com/aragozin/jvm-tools

With commands like, the following you could start understanding what is
happening in the heap.

java -jar sjk-0.10.1.jar ttop -p  -n 20 -o CPU
# On my mac/ccm test cluster I ran something like this:
java -jar sjk-0.10.1.jar ttop -p $(ps u | grep cassandra | grep -v
grep | awk '{print $2}' | head -n 1) -n 25 -o CPU


Anything else I can do to conclude whether this is GC related or not ?
>

In most cases, it is possible to have GC pauses reaching lower values than
3 - 5% of stop the world pauses (ie. node unavailable 3 to 5% of the time
doing GC). Which leaves 95+% of the time for the user application, Apache
Cassandra in our case. If you want to share the gc.logs during a spike in
latencies we could probably let you know how GC is performing.

What hardware are you using?

C*heers,
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-07-01 8:26 GMT+01:00 Tunay Gür :

> Thanks for the recommendation Jeff, I'll try to get a heap dump next time
> this happens and try the other changes in the mean time.
>
> Also not sure but this CASSANDRA-13900 looked it might be related.
>
> On Sat, Jun 30, 2018 at 9:51 PM, Jeff Jirsa  wrote:
>
>> The young gcs loom suspicious
>>
>> Without seeing the heap it’s hard to be sure, but be sure you’re
>> adjusting your memtable (size and flush threshold), and you may find moving
>> it offheap helps too
>>
>> Beyond that, debugging usually takes a heap dump and inspection with
>> yourkit or MAT or similar
>>
>> 3.0.14.10 reads like a Datastax version - I know there’s a few reports of
>> recyclers not working great in 3.11.x but haven’t seen many heap related
>> leak concerns with 3.0.14
>>
>> --
>> Jeff Jirsa
>>
>>
>> On Jun 30, 2018, at 5:49 PM, Tunay Gür  wrote:
>>
>> Dear Cassandra users,
>>
>> I'm observing high coordinator latencies (spikes going over 1sec for P99)
>> without corresponding keyspace read latencies. After researching this list
>> and public web, I focused my investigation around GC, but still couldn't
>> convince myself %100 (mainly because my lack of experience in JVM GC and
>> Cassandra behavior). I'd appreciate if you can help me out.
>>
>> *Setup:*
>> - 2DC 40 nodes each
>> - Cassandra Version: 3.0.14.10
>> - G1GC
>> - -Xms30500M -Xmx30500M
>> - Traffic mix:  20K continuous RPS  + 10K continuous WPS + 40K WPS daily
>> bulk ingestion (for 2 hours)
>> - Row cache disabled, Keycache 5GB capacity
>>
>> *Some observations:*
>> - I don't have clear repro steps, but I feel like high coordinator
>> latencies gets triggered by some sudden change in traffic (i.e bulk
>> ingestion or DC failover). For example last time it happened, bulk
>> ingestion triggered it and coordinator latencies keep spiraling up until I
>> drain some of the traffic:
>>
>> 
>> 
>> - I see corresponding increase in GC warning logs that looks similar to
>> this:
>>
>> G1 Young Generation GC in 3543ms. G1 Eden Space: 1535115264 -> 0; G1 Old
>> Gen: 14851011568 -> 1458593

Re: C* in multiple AWS AZ's

2018-07-03 Thread Alain RODRIGUEZ

Hi,

Nice that you solved the issue. I had some thoughts while reading:

> My original thought was a new DC parallel to the current, and then
> decommission the other DC.
>

I also think this is the best way to go when possible. It can be reverted
any time in the process, respect distributions, you can switch app per app,
and some other advantages.

Also, I am a bit late on this topic, but using this:

AP-SYDNEY, and US-EAST.. I'm using Ec2Snitch over a site-to-site tunnel..
> I'm wanting to move the current US-EAST from AZ 1a to 1e.

You could have kept Ec2Snitch and use the 'dc_suffix' option in
'cassandra-rackdc.properties'
file to create a non-conflicting logical datacenter in the same region. For
example with a 'dc_suffix' set to '-WHATEVER' the new data center created
would have been called 'US-EAST-WHATEVER' :-).

I know all docs say use ec2multiregion for multi-DC.
>

Using EC2RegionSnitch brought nothing but troubles to me in the past. You
have to open security groups for both public and private IPs, the number of
rules is limited (or was for me when I did it) and after some headaches, we
did the same, a tunnel between VPCs and used ec2_snitch.

Yet, now I would keep GPFS if that works fully for you now. It will even
allow you to hybrid the cluster with non AWS hardware or transition out of
AWS if you need at some point in the future.

As a side note, using i3, you might have to use a recent operating system
(such as Ubuntu 16.04) to have the latest drivers for NVMe. NVMe support in
Ubuntu 14.04 AMI is not reliable. It might be absent or lead to data
corruption. Make sure the OS in use works well with this hardware.

https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ssd-instance-store.html

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-06-29 16:35 GMT+01:00 Pradeep Chhetri :

> Ohh i see now. It makes sense. Thanks a lot.
>
> On Fri, Jun 29, 2018 at 9:17 PM, Randy Lynn  wrote:
>
>> data is only lost if you stop the node. between restarts the storage is
>> fine.
>>
>> On Fri, Jun 29, 2018 at 10:39 AM, Pradeep Chhetri 
>> wrote:
>>
>>> Isnt NVMe storage an instance storage ie. the data will be lost in case
>>> the instance restarts. How are you going to make sure that there is no data
>>> loss in case instance gets rebooted?
>>>
>>> On Fri, 29 Jun 2018 at 7:00 PM, Randy Lynn  wrote:
>>>
>>>> GPFS - Rahul FTW! Thank you for your help!
>>>>
>>>> Yes, Pradeep - migrating to i3 from r3. moving for NVMe storage, I did
>>>> not have the benefit of doing benchmarks.. but we're moving from 1,500 IOPS
>>>> so I intrinsically know we'll get better throughput.
>>>>
>>>> On Fri, Jun 29, 2018 at 7:21 AM, Rahul Singh <
>>>> rahul.xavier.si...@gmail.com> wrote:
>>>>
>>>>> Totally agree. GPFS for the win. EC2 multi region snitch is an
>>>>> automation tool like Ansible or Puppet. Unless you have two orders of
>>>>> magnitude more servers than you do now, you don’t need it.
>>>>>
>>>>> Rahul
>>>>> On Jun 29, 2018, 6:18 AM -0400, kurt greaves ,
>>>>> wrote:
>>>>>
>>>>> Yes. You would just end up with a rack named differently to the AZ.
>>>>> This is not a problem as racks are just logical. I would recommend
>>>>> migrating all your DCs to GPFS though for consistency.
>>>>>
>>>>> On Fri., 29 Jun. 2018, 09:04 Randy Lynn,  wrote:
>>>>>
>>>>>> So we have two data centers already running..
>>>>>>
>>>>>> AP-SYDNEY, and US-EAST.. I'm using Ec2Snitch over a site-to-site
>>>>>> tunnel.. I'm wanting to move the current US-EAST from AZ 1a to 1e..
>>>>>> I know all docs say use ec2multiregion for multi-DC.
>>>>>>
>>>>>> I like the GPFS idea. would that work with the multi-DC too?
>>>>>> What's the downside? status would report rack of 1a, even though in
>>>>>> 1e?
>>>>>>
>>>>>> Thanks in advance for the help/thoughts!!
>>>>>>
>>>>>>
>>>>>> On Thu, Jun 28, 2018 at 6:20 PM, kurt greaves 
>>>>>> wrote:
>>>>>>
>>>>>>> There is a need for a repair with both DCs as rebuild will not
>>>>>>> stream all replicas, so unless you can guarantee you were perfectly
>>>>>>> consistent at time of rebuild you'll want to do a re

Re: JVM Heap erratic

2018-07-03 Thread Alain RODRIGUEZ

 the past really.
There is a lot of details in this log file about where is the biggest
pressure, the allocation rate, the GC duration distribution for each type
of GC, etc. With this, I could see where the pressure is and suggest how to
work on it.

Be aware that extra GC is also sometimes the consequence (and not a cause)
of an issue. Due to pending requests, wide partitions, ongoing compactions,
repairs or an intensive workload, GC can pressure can increase and mask
another underlying, and root issue. You might want to check that the
cluster is healthy other than GC, as a lot of distinct internal parts of
Cassandra have an impact on the GC.

Hope that helps,

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2018-06-28 23:19 GMT+01:00 Elliott Sims :

> Odd.  Your "post-GC" heap level seems a lot lower than your max, which
> implies that you should be OK with ~10GB.  I'm guessing either you're
> genuinely getting a huge surge in needed heap and running out, or it's
> falling behind and garbage is building up.  If the latter, there might be
> some tweaking you can do.  Probably worth turning on GC logging and digging
> through exactly what's happening.
>
> CMS is kind of hard to tune and can have problems with heap fragmentation
> since it doesn't compact, but if it's working for you I'd say stick with it.
>
> On Thu, Jun 28, 2018 at 3:14 PM, Randy Lynn  wrote:
>
>> Thanks for the feedback..
>>
>> Getting tons of OOM lately..
>>
>> You mentioned overprovisioned heap size... well...
>> tried 8GB = OOM
>> tried 12GB = OOM
>> tried 20GB w/ G1 = OOM (and long GC pauses usually over 2 secs)
>> tried 20GB w/ CMS = running
>>
>> we're java 8 update 151.
>> 3.11.1.
>>
>> We've got one table that's got a 400MB partition.. that's the max.. the
>> 99th is < 100MB, and 95th < 30MB..
>> So I'm not sure that I'm overprovisioned, I'm just not quite yet to the
>> heap size based on our partition sizes.
>> All queries use cluster key, so I'm not accidentally reading a whole
>> partition.
>> The last place I'm looking - which maybe should be the first - is
>> tombstones.
>>
>> sorry for the afternoon rant! thanks for your eyes!
>>
>> On Thu, Jun 28, 2018 at 5:54 PM, Elliott Sims 
>> wrote:
>>
>>> It depends a bit on which collector you're using, but fairly normal.
>>> Heap grows for a while, then the JVM decides via a variety of metrics that
>>> it's time to run a collection.  G1GC is usually a bit steadier and less
>>> sawtooth than the Parallel Mark Sweep , but if your heap's a lot bigger
>>> than needed I could see it producing that pattern.
>>>
>>> On Thu, Jun 28, 2018 at 9:23 AM, Randy Lynn  wrote:
>>>
>>>> I have datadog monitoring JVM heap.
>>>>
>>>> Running 3.11.1.
>>>> 20GB heap
>>>> G1 for GC.. all the G1GC settings are out-of-the-box
>>>>
>>>> Does this look normal?
>>>>
>>>> https://drive.google.com/file/d/1hLMbG53DWv5zNKSY88BmI3Wd0ic
>>>> _KQ07/view?usp=sharing
>>>>
>>>> I'm a C# .NET guy, so I have no idea if this is normal Java behavior.
>>>>
>>>>
>>>>
>>>> --
>>>> Randy Lynn
>>>> rl...@getavail.com
>>>>
>>>> office:
>>>> 859.963.1616 <+1-859-963-1616> ext 202
>>>> 163 East Main Street - Lexington, KY 40507 - USA
>>>> <https://maps.google.com/?q=163+East+Main+Street+-+Lexington,+KY+40507+-+USA=gmail=g>
>>>>
>>>> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>>>>
>>>
>>>
>>
>>
>> --
>> Randy Lynn
>> rl...@getavail.com
>>
>> office:
>> 859.963.1616 <+1-859-963-1616> ext 202
>> 163 East Main Street - Lexington, KY 40507 - USA
>> <https://maps.google.com/?q=163+East+Main+Street+-+Lexington,+KY+40507+-+USA=gmail=g>
>>
>> <https://www.getavail.com/> getavail.com <https://www.getavail.com/>
>>
>
>

Re: how to immediately delete tombstones

2018-06-04 Thread Alain RODRIGUEZ

Hello,

When you don't have any disk space anymore (or not much), there are things
you can do :



*Make some space:*
- Remove snapshots (nodetool clearsnapshot)
- Remove any heap dump that might be stored there
- Remove *old* -tmp- SSTables that could still be around.
- Truncate unused table / data: Truncate does not create tombstone, but
removes the files, after creating a snapshot (default behavior). Thus
truncating/dropping a table and removing the snapshot could produce
immediate disk space availability.
- Anything else than SSTable in this disk that can be removed?

OR

*Add some space:*

- Add some disk space temporary (EBS, physical disk)
- Make sure to clean tombstones ('uncheck_tombstone_compaction: true' often
helped)
- Wait for tombstones to be compacted with all the data it shadows and disk
space to reduce
- Remove extra disk

OR

*Play around, at the edge*:

- Look at the biggest SSTables that you can actually compact (be aware of
the compressed vs uncompressed sizes when monitoring - I think the thing
was that 'nodetool compactionstats -H' shows the values uncompressed)
- Use sstablemetadata to determine the ratio of the data that is droppable
- Run user defined compaction on these sstables specifically
- If it works and there is more disk space available, reproduce with bigger
sstables.

In some complex cases removing commit logs helped as well, but this is
riskier already as it would be playing with consistency/durability
considerations.

I'm using cassandra on a single node.
>

I would not play with commit logs with a single-node setup. But I imagine
it is not a production 'cluster' either.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-06-02 8:29 GMT+01:00 Nitan Kainth :

> You can compact selective sstables using jmx Call.
>
> Sent from my iPhone
>
> On Jun 2, 2018, at 12:04 AM, onmstester onmstester 
> wrote:
>
> Thanks for your replies
> But my current situation is that i do not have enough free disk for my
> biggest sstable, so i could not run major compaction or nodetool
> garbagecollect
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>  On Thu, 31 May 2018 22:32:32 +0430 *Alain RODRIGUEZ
> >* wrote 
>
>
>

Re: how to immediately delete tombstones

2018-05-31 Thread Alain RODRIGUEZ

Hello,

It's a very common but somewhat complex topic. We wrote about it 2 years
ago and I really think this post might have answers you are looking for:
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html

Something that you could try (if you do care ending up with one big sstable
:)) is to enable 'unchecked_tombstone_compaction'. This removes a pre-check
supposed to only trigger tombstone compactions when it is worth it (ie
there is no sstable overlaps) but that over time proved to be inefficient
from a tombstone eviction perspective.

If data can actually be deleted (no overlaps, gc_grace_seconds lowered or
reached, ...) then changing option 'unchecked_tombstone_compaction' to
'true' might do a lot of good in terms of disk space. Be aware that a bunch
of compactions might be triggered and that disk space will start by
increasing before (possibly) reducing after the compactions.

The gc_grace of table was default (10 days), now i set that to 0, although
> many compactions finished but no space reclaimed so far.
>

Be aware that if some deletes did not reach all the replicas, the data will
eventually come back as lowering the gc_grace_seconds, you don't allow
repairs to process the data before tombstones are actually evicted.

Also, by setting the gc_grace_seconds to 0, you also disabled the hints
altogether. gc_grace_seconds should always be equivalent to
'max_hint_windows_in_ms'. My colleague Radovan wrote a post with
more information on this:
http://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html

Good luck with your tombstones, again, those are a bit tricky to handle
sometimes ;-)

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-05-31 16:19 GMT+01:00 Nicolas Guyomar :

> Hi,
>
> You need to manually force compaction if you do not care ending up with
> one big sstable (nodetool compact)
>
> On 31 May 2018 at 11:07, onmstester onmstester 
> wrote:
>
>> Hi,
>> I've deleted 50% of my data row by row now disk usage of cassandra data
>> is more than 80%.
>> The gc_grace of table was default (10 days), now i set that to 0,
>> although many compactions finished but no space reclaimed so far.
>> How could i force deletion of tombstones in sstables and reclaim the disk
>> used by deleted rows?
>> I'm using cassandra on a single node.
>>
>> Sent using Zoho Mail <https://www.zoho.com/mail/>
>>
>>
>>
>

Re: cassandra concurrent read performance problem

2018-05-28 Thread Alain RODRIGUEZ

Hi,

Would you share some more context with us?

- What Cassandra version do you use?
- What is the data size per node?
- How much RAM does the hardware have?
- Does your client use paging?

A few ideas to explore:

- Try tracing the query, see what's taking time (and resources)
- From the tracing, logs, sstablemetadata tool or monitoring dashboard, do
you see any tombstone?
- What is the percentage of GC pause per second? 128 GB seems huge to me,
even with G1GC. Do you still have memory for page caching? Also from
general logs, gc logs or dashboard. Reallocating 70GB every minute does not
seem right. Maybe using a smaller size for the heap (more common) would
have more frequent but smaller pauses?
- Any pending/blocked thread (monitoring charts about thread pool or
'nodetool tpstats'. Also 'watch -d "nodetool tpstats' will make evolution
and newly pending/blocked thread obvious to you (or a cassandra restart
reset stats as well).
- What is the number of SSTable touched per read operations on the main
tables?
- Are the bloom filters efficient?
- Is key cache efficient (ratio of hit 0.8, 0.9+)
- The logs should be reporting something during the 10 minutes the machines
were unresponsive, give a try to: grep -e "WARN" -e "ERROR"
/var/log/cassandra/system.log

More than 200 MB per partitions is quite big. Explore improving what can be
operationally, but you might have to reduce the partition size ultimately.
On the other side, Cassandra tends to evolve allowing bigger partition
sizes, as it handles them with a better efficiency over time. If you can
work on the operational side, you might be able to keep this model.

If it is possible to experiment on a canary node and observe, I would
probably go this path after identifying a possible origin and solution for
this issue.

Other tips that might help here:
- Disabling 'dynamic snitching' proved to improve performances (often
clearly visible looking at p99) as there is a better usage of page caching
(disk) mostly.
- Making sure that most of your partitions fit within the read block size
(buffer) you are using can also make reads more efficient (when data is
compressed, the chunk size determines the buffer size.

I hope, this helps. I am curious about that one, please let us know what
you find out :).

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-05-26 14:21 GMT+01:00 onmstester onmstester <onmstes...@zoho.com>:

> By reading 90 partitions concurrently(each having size > 200 MB), My
> single node Apache Cassandra became unresponsive,
> no read and write works for almost 10 minutes.
> I'm using this configs:
> memtable_allocation_type: offheap_buffers
> gc: G1GC
> heap: 128GB
> concurrent_reads: 128 (having more than 12 disk)
>
> There is not much pressure on my resources except for the memory that the
> eden with 70GB is filled and reallocated in less than a minute.
> Cpu is about 20% while read is crashed and iostat shows no significant
> load on disk.
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>

Re: Log application Queries

2018-05-25 Thread Alain RODRIGUEZ

Nate wrote a post about this exact topic. In case it is of some use:
http://thelastpickle.com/blog/2016/02/10/locking-down-apache-cassandra-logging.html

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-05-25 18:00 GMT+01:00 Nitan Kainth <nitankai...@gmail.com>:

> So settraceprobability is better option than nodetool :)
>
>
> Regards,
> Nitan K.
> Cassandra and Oracle Architect/SME
> Datastax Certified Cassandra expert
> Oracle 10g Certified
>
> On Fri, May 25, 2018 at 12:15 PM, Surbhi Gupta <surbhi.gupt...@gmail.com>
> wrote:
>
>> nodeool setlogginglevel is only valid for below :
>>
>>
>>- org.apache.cassandra
>>- org.apache.cassandra.db
>>- org.apache.cassandra.service.StorageProxy
>>
>>
>> On 25 May 2018 at 09:01, Nitan Kainth <nitankai...@gmail.com> wrote:
>>
>>> Thanks Surbhi. I found another way. I used nodetool settraceprobability
>>> 1 and it is logging in system_traces.
>>>
>>> How is it different from nodeool setlogginglevel?
>>>
>>>
>>> Regards,
>>> Nitan K.
>>> Cassandra and Oracle Architect/SME
>>> Datastax Certified Cassandra expert
>>> Oracle 10g Certified
>>>
>>> On Fri, May 25, 2018 at 11:41 AM, Surbhi Gupta <surbhi.gupt...@gmail.com
>>> > wrote:
>>>
>>>> If using dse then u can enable in dse.yaml.
>>>>
>>>> # CQL slow log settings
>>>> cql_slow_log_options:
>>>> enabled: true
>>>> threshold_ms: 0
>>>> ttl_seconds: 259200
>>>>
>>>> As far as my understanding says setlogginglevel  is used for changing
>>>> the logging level as below but not for slow query .
>>>>
>>>>- ALL
>>>>- TRACE
>>>>- DEBUG
>>>>- INFO
>>>>- WARN
>>>>- ERROR
>>>>- OFF
>>>>
>>>>
>>>>
>>>> On 25 May 2018 at 08:24, Nitan Kainth <nitankai...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I would like to log all C* queries hitting cluster. Could someone
>>>>> please tell me how can I do it at cluster level?
>>>>> Will nodetool setlogginglevel work? If so, please share example with
>>>>> library name.
>>>>>
>>>>> C* version 3.11
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Cassandra Monitoring tool

2018-05-25 Thread Alain RODRIGUEZ

Hello,

With more details on your criteria, it would be easier.

I used:

- SPM from Sematext (Commercial - dashboards out of the box)
- Datastax OpsCenter (Commercial, With DSE only nowadays)
- Grafana/Graphite or Prometheus (Open source - 'do it yourself' - but some
templates exist)
- Datadog (Commercial - dashboards out of the box)

It depends what you want really. All those worked for me at some point in
time :).

Note: We built the Dashboards that are out of the box for Datadog. I
believe it should put anyone starting with Cassandra in good conditions to
operate and troubleshot Apache Cassandra. My opinion here is biased most
definitely :). I liked the interfaces of Datadog (using regularly) and
Sematext (old experience) the most.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-05-25 17:23 GMT+01:00 ANEESH KUMAR K.M <anee...@gmail.com>:

> Please suggest me some good cluster monitoring tool for cassandra multi
> region cluster.
>
>

Re: unsubscribe

2018-05-25 Thread Alain RODRIGUEZ

Hello Matthias,

I don't think you really left :). Give this address a try instead:
user-unsubscr...@cassandra.apache.org

;-)

C*iao,

2018-05-24 23:37 GMT+01:00 Matthias Hübner :

> Ciao
>

Re: Reading Data from C* Cluster

2018-05-25 Thread Alain RODRIGUEZ

Hello,

Where I can have configurable codes in partition key as cassandra supports.
>

I am sorry I don't understand this.

1) How much data I can put in one partition ?
>

Well, Cassandra supports huge partitions. In practice, the value of 100 MB
per partition is often shared. I consider it's a good 'soft' limit.
Basically, with such a small table/strings combination, you will probably
not reach limits any time soon unless the workload is really intensive. The
hard limit is kind of irrelevant I would say, Cassandra will be working
poorly before reaching it that the precise value does not matter much.

Nonetheless, make sure to pick a good partition key. A good partition key
should be ideally, meaningful and allow you to retrieve all the data you
need querying one partition (or just a few). A good partition key should
also allow the partitions/rows to all have a similar size (as much as
possible) and to be distributed evenly in the token ring as well (using a
date as the partition key for example, will induce all the request for this
date to only hit a small portion of the servers - 1 server and its replica
only). All the fragments of the same partition go to 1 node (and to the
corresponding replicas). It is easy to create imbalances.

It is really important to pick this key appropriately considering these
best practices/constraints.

2) I want to read data from cassandra as
> SELECT * from code where campaign_id_sub_partition = 'XXX-1';
> Is reading whole partition at one time is feasible or shall I use
> pagination ?
>

 I would make sure to use a client that uses pagination under the hood.
Modern clients do it for you. Depending on the size of the partition, the
pagination will be useless, nice to have or mandatory not to reach the
timeout. But I would really try not to implement this and pick some client
that do that already.

C*heers,
-------
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2018-05-24 19:55 GMT+01:00 raman gugnani <ramangugnani@gmail.com>:

> HI
>
> I want to read data from C* Cluster.
>
> Schema is
>
> CREATE TABLE code(
> campaign_id_sub_partition text
> code text,
> PRIMARY KEY ((campaign_id_sub_partition),code))
> );
>
> Where I can have configurable codes in partition key as cassandra supports.
> campaign_id_sub_partition is appx 10 characters
> code field is appx 12~20 characters
>
>
> Query :
> 1) How much data I can put in one partition ?
>
> 2) I want to read data from cassandra as
> SELECT * from code where campaign_id_sub_partition = 'XXX-1';
> Is reading whole partition at one time is feasible or shall I use
> pagination ?
>
> --
> Raman Gugnani
>
> 8588892293
>
>

Re: repair in C* 3.11.2 and anticompactions

2018-05-24 Thread Alain RODRIGUEZ

Hi Jean,

Here is what Alexander wrote about it, a few months ago, in the comments of
the article mentioned above:

"A full repair is an incremental one that doesn't skip repaired data.
> Performing anticompaction in that case too (regardless it is a valid
> approach or not) allows to mark as repaired SSTables that weren't before
> the full repair was started.
>
> It was clearly added with the intent of making full repair part of a
> routine where incremental repairs are also executed, leaving only subrange
> for people who do not want to use incremental.
>
> One major drawback is that by doing so, the project increased the
> operational complexity of running full repairs as it does not allow
> repairing the same keyspace from 2 nodes concurrently without risking some
> failures during validation compaction (when an SSTable is being
> anticompacted, it cannot go through validation compaction)."
>
>
 I hope this helps,

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-05-23 21:48 GMT+01:00 Lerh Chuan Low <l...@instaclustr.com>:

> Hey Jean,
>
> I think it still does anticompaction by default regardless, it will not do
> so only if you do subrange repair. TLP wrote a pretty good article on that:
> http://thelastpickle.com/blog/2017/12/14/should-
> you-use-incremental-repair.html
>
> On 24 May 2018 at 00:42, Jean Carlo <jean.jeancar...@gmail.com> wrote:
>
>> Hello
>>
>> I just want to understand why, if I run a repair non incremental like this
>>
>> nodetool -h 127.0.0.1 -p 7100 repair -full -pr keyspace1 standard1
>>
>> Cassandra does anticompaction as the logs show
>>
>> INFO  [CompactionExecutor:20] 2018-05-23 16:36:27,598
>> CompactionManager.java:1545 - Anticompacting [BigTableReader(path='/home/jr
>> iveraura/.ccm/test/node1/data0/keyspace1/standard1-36a6ec405
>> e9411e8b1d1b38a73559799/mc-2-big-Data.db')]
>>
>> As far as I understood the anticompactions are used to make the repair
>> incremantals possible, so I was expecting no having anticompactions making
>> repairs with the options  -pr -full
>>
>> Anyone knows why does cassandra make those anticompactions ?
>>
>> Thanks
>>
>> Jean Carlo
>>
>> "The best way to predict the future is to invent it" Alan Kay
>>
>
>

Re: Why nodetool cleanup should be run sequentially after node joined a cluster

2018-04-11 Thread Alain RODRIGUEZ

I confirm what Christophe said.

I always ran them in parallel without any problem, really. Historically it
was using only one compactor and impact in my clusters have always been
acceptable.

Nonetheless, newer Cassandra versions allow multiple compactor to work in
parallel during cleanup and this can be really harmful - or really
efficient if resources are available and it is not impacting the read and
write operations. If all the nodes run cleanup in parallel, then limiting
the number of threads used per node is really important.

My colleague Anthony described this option here:
http://thelastpickle.com/blog/2017/08/14/limiting-nodetool-parallel-threads.html
.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com


2018-04-11 6:04 GMT+01:00 Christophe Schmitz <christo...@instaclustr.com>:

> Hi Mikhail,
>
>
> Nodetool cleanup can add a fair amount of extra load (mostly IO) on your
> Cassandra nodes. Therefore it is recommended to run it during lower cluster
> usage, and one node at a time, in order to limit the impact on your
> cluster. There are no technical limitations that would prevent you to run
> it at the same time. It's just a precaution measure.
>
> Cheers,
> Christophe
>
>
> On 11 April 2018 at 14:49, Mikhail Tsaplin <tsmis...@gmail.com> wrote:
>
>> Hi,
>> In https://docs.datastax.com/en/cassandra/3.0/cassandra/oper
>> ations/opsAddNodeToCluster.html
>> there is recommendation:
>> 6) After all new nodes are running, run nodetool cleanup
>> <https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsCleanup.html>
>>  on each of the previously existing nodes to remove the keys that no
>> longer belong to those nodes. Wait for cleanup to complete on one node
>> before running nodetool cleanup on the next node.
>>
>> I had added a new node to the cluster, and running nodetool cleanup
>> according to this recommendation - but it takes near 10 days to complete on
>> a single node. Is it safe to start it on all nodes?
>>
>
>
>
> --
>
> *Christophe Schmitz - **VP Consulting*
>
> AU: +61 4 03751980 / FR: +33 7 82022899
>
> <https://www.facebook.com/instaclustr>   <https://twitter.com/instaclustr>
><https://www.linkedin.com/company/instaclustr>
>
> Read our latest technical blog posts here
> <https://www.instaclustr.com/blog/>. This email has been sent on behalf
> of Instaclustr Pty. Limited (Australia) and Instaclustr Inc (USA). This
> email and any attachments may contain confidential and legally
> privileged information.  If you are not the intended recipient, do not copy
> or disclose its content, but please reply to this email immediately and
> highlight the error to the sender and then immediately delete the message.
>

Re: write latency on single partition table

2018-04-09 Thread Alain RODRIGUEZ

Hi,

Challenging the possibilty that the latancy is related with the number of
record is a good guess indeed. It might be, but I don't think so, given the
max 50 Mb partition size. This should should allow to catch a partition of
this size, probably below 1 second.

It is possible to trace a query and see how it perform throughout the
distinct internal processes, and find what takes time. There are multiple
way to do so:

- 'TRACING ON'  in cqlsh, then run a problematic query (pay attention to
the consistency level - ONE by default. Use the one in use in the
application facing latencies).
- 'nodetool settraceprobability 0.001' - (here, be careful with
implications of setting this value too high, query are tracked inside
Cassandra, potentially generating a heavy load.

Other interesting global info:

- 'nodetool cfhistograms' (or tablehistograms?) - to have more precise
statistics on percentiles for example
- 'nodetool cfstats' (or tablestats) - detailed informations on how the
table/queries are performing on the node
- 'nodetool tpstats' - Thread pool statistics. Look for pending, dropped or
blocked tasks, generally, it's not good :).

If you suspect tombstones, you can use sstablemetadata to check the
tombstone ratio. It can also be related to poor caching, the number of
sstable hit on disk or inefficient bloom filters for example. There are
other reasons to slow reads.

When it comes to the read path, multiple parts come into play and the
global result is a bit complex to troubleshoot. Yet trying to narrow down
the scope, to eliminate possibilities one by one or directly detect the
issue using tracing to find out the latency comes from mostly.

If you find something weird but unclear to you, post here again and we will
hopefully able to help with extra information on the part that is slow :).

C*heers!
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-04-07 6:05 GMT+01:00 onmstester onmstester <onmstes...@zoho.com>:

> The size is less than 50MB
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>  On Sat, 07 Apr 2018 09:09:41 +0430 *Laxmikant Upadhyay
> <laxmikant@gmail.com <laxmikant@gmail.com>>* wrote 
>
> It seems your partition size is more..what is the size of value field ?
> Try to keep your partition size within 100 mb.
>
> On Sat, Apr 7, 2018, 9:45 AM onmstester onmstester <onmstes...@zoho.com>
> wrote:
>
>
>
> I've defained a table like this
>
> create table test (
> hours int,
> key1 int,
> value1 varchar,
> primary key (hours,key1)
> )
>
> For one hour every input would be written in single partition, because i
> need to group by some 500K records in the partition for a report with
> expected response time in less than 1 seconds so using key1 in partition
> key would made 500K partitions which would be slow on reads.
> Although using  this mechanism gains < 1 seconds response time on reads
> but the write delay increased surprisingly, for this table write latency
> reported by cfstats is more than 100ms but for other tables which accessing
> thousands of partitions while writing in 1 hour , the write delay is
> 0.02ms. But i was expecting that writes to test table be faster than other
> tables because always only one node and one partition would be accessed, so
> no memtable switch happens and all writes would be local to a single node?!
> Should i add another key to my partition key to distribute data on all of
> nodes?
>
> Sent using Zoho Mail <https://www.zoho.com/mail/>
>
>
>

Re: Urgent Problem - Disk full

2018-04-04 Thread Alain RODRIGUEZ

Hi,

When the disks are full, here are the options I can think of depending on
the situation and how 'full' the disk really is:

- Add capacity - Add a disk, use JBOD adding a second location folder for
the sstables and move some of them around then restart Cassandra. Or add a
new node.
- Reduce disk space used. Some options come to my mind to reduce space used:

1 - Clean tombstones *if any* (use sstablemetadata for example to check the
number of tombstones). If you have some not being purged, my first guess
would be to set 'unchecked_tombstone_compaction' to 'true' at the node
level. Yet be aware that this will trigger some compactions, that before
freeing space, start by taking some more temporary!

If remaining space is really low on one node, you can control to compact
only on the sstables having the higher tombstone ratio after you made the
change above and that fit in the disk space you have left. It can even be
scripted. It worked for me in the past with disk 100% full. If you do so,
you might have to disable/reenable automatic compactions at key moments as
well

2 -  If you added nodes recently to the data center you can consider
running a 'nodetool cleanup', but here again, it will start by using more
space for temporary sstables, and might have no positive impacts if the
node only own data for its token ranges.

3 - Another common way to easily claim space is to clear snapshots that are
not needed and might have been forgotten or taken by Cassandra:
'nodetool clearsnapshot'. This has no other risk than removing a useful
backup.

4 - Delete data from this table or another table (effectively),
directly removing the sstables indeed - as you use TWCS. If you don't need
the data anyway.

5 - Truncate one of those other tables we tend to have that are written
'just in case' and actually never used and never read for months. It has
been a powerful way out of this situation for me in the past too :). I
would say: be sure that the disk space is used properly.

There is zero reason to believe a full repair would make this better and a
> lot of reason to believe it’ll make it worse
>

I second that too, just in case. Really, do not run a repair. The only
thing it could do is bring more data to a node that really don't need it
for now.

Finally, when this is behind you, the disk size is something you could
consider monitoring as it is way easier to fix it when the disk is not
completely full and it can be fixed preemptively. Usually, 50 to 20% of
free disk is recommended depending on your use case.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-04-04 15:34 GMT+01:00 Kenneth Brotman <kenbrot...@yahoo.com.invalid>:

> There's also the old snapshots to remove that could be a significant
> amount of memory.
>
> -Original Message-
> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID]
> Sent: Wednesday, April 04, 2018 7:28 AM
> To: user@cassandra.apache.org
> Subject: RE: Urgent Problem - Disk full
>
> Jeff,
>
> Just wondering: why wouldn't the answer be to:
> 1. move anything you want to archive to colder storage off the
> cluster,
> 2. nodetool cleanup
> 3. snapshot
> 4. use delete command to remove archived data.
>
> Kenneth Brotman
>
> -Original Message-
> From: Jeff Jirsa [mailto:jji...@gmail.com]
> Sent: Wednesday, April 04, 2018 7:10 AM
> To: user@cassandra.apache.org
> Subject: Re: Urgent Problem - Disk full
>
> Yes, this works in TWCS.
>
> Note though that if you have tombstone compaction subproperties set, there
> may be sstables with newer filesystem timestamps that actually hold older
> Cassandra data, in which case sstablemetadata can help finding the sstables
> with truly old timestamps
>
> Also if you’ve expanded the cluster over time and you see an imbalance of
> disk usage on the oldest hosts, “nodetool cleanup” will likely free up some
> of that data
>
>
>
> --
> Jeff Jirsa
>
>
> > On Apr 4, 2018, at 4:32 AM, Jürgen Albersdorfer <
> juergen.albersdor...@zweiradteile.net> wrote:
> >
> > Hi,
> >
> > I have an urgent Problem. - I will run out of disk space in near future.
> > Largest Table is a Time-Series Table with TimeWindowCompactionStrategy
> (TWCS) and default_time_to_live = 0
> > Keyspace Replication Factor RF=3. I run C* Version 3.11.2
> > We have grown the Cluster over time, so SSTable files have different
> Dates on different Nodes.
> >
> > From Application Standpoint it would be safe to loose some of the oldest
> Data.
> >
> > Is it safe to delete some of the oldest SSTable Files, which will no
> longer get touched by

Re: datastax cassandra minimum hardware recommendation

2018-04-04 Thread Alain RODRIGUEZ

Hello.

For questions to Datastax, I recommend you to ask them directly. I often
had a quick answer and they probably can answer this better than we do :).

Apache Cassandra (and probably DSE-Cassandra) can work with 8 CPU (and
less!). I would not go much lower though. I believe the memory amount and
good disk throughputs are more important. It also depends on the workload
type and intensity, encryption, compression etc.

8 CPUs is probably just fine if well tuned, and here in the mailing list,
we 'support' any fancy configuration settings, but with no guarantee on the
response time and without taking the responsibility for your cluster :).

It reminds me of my own start with Apache Cassandra. I started with
t1.micro back then on AWS, and people were still helping me here, of course
after a couple of jokes such as 'you should rather try to play a
PlayStation 4 game in your Gameboy', that's fair enough I guess :). Well it
was working in prod and I learned how to tune Apache Cassandra, I had no
other options to have this working.

Having more CPU probably improves resiliency to some problems and reduces
the importance of having a cluster perfectly tuned.

Benchmark your workload, test it. This would be the most accurate answer
here given the details we have.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-04-04 9:44 GMT+01:00 sujeet jog <sujeet@gmail.com>:

> the datastax site has a hardware recommendation of 16CPU / 32G RAM for DSE
> Enterprise,  Any idea what is the minimum hardware recommendation
> supported, can each node be 8CPU and the support covering it ?..
>

Re: nodetool repair and compact

2018-04-02 Thread Alain RODRIGUEZ

I have just this been told that my first statement is inaccurate:

If  'upgradesstable' is run as a routine operation, you might forget about
> it and suffer consequences. 'upgradesstable' is not only doing the
> compaction.


I should probably have checked upgradesstable closely before making this
statement and I definitely will.

Yet, I believe the second point still holds though: 'With UDC, you can
trigger the compaction of the sstables you want to remove the tombstones
from, instead of compacting *all* the sstables for a given table.'

C*heers,

2018-04-02 16:39 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>:

> Hi,
>
> it will re-write this table's sstable files to current version, while
>> re-writing, will evit droppable tombstones (expired +  gc_grace_seconds
>> (default 10 days) ), if partition cross different files, they will still
>> be kept, but most droppable tombstones gone and size reduced.
>>
>
> Nice tip James, I never thought about doing this, it could have been handy
> :).
>
> Now, these compactions can be automatically done using the proper
> tombstone compaction settings in most cases. Generally, tombstone
> compaction is enabled, but if tombstone eviction is still an issue, you
> might want to give a try enabling 'unchecked_tombstone_compaction' in the
> table options. This might claim quite a lot of disk space (depending on the
> sstable overlapping levels).
>
> In case manual action is really needed (even more if it is run
> automatically), I would recommend using 'User Defined Compactions' - UDC
> (accessible through JMX at least) instead of 'uprade sstable':
>
> - It will remove the tombstones the same way, but with no side effect if
> you are currently upgrading for example. If  'upgradesstable' is run as a
> routine operation, you might forget about it and suffer consequences.
> 'upgradesstable' is not only doing the compaction.
> - With UDC, you can trigger the compaction of the sstables you want to
> remove the tombstones from, instead of compacting *all* the sstables for a
> given table.
>
> This last point can prevent harming the cluster with useless compaction,
> and even allow the operator to do things like: 'Compact the 10% biggest
> sstables, that have an estimated tombstone ratio above 0.5, every day' or
> 'compact any sstable having more than 75% of tombstones' as you see fit,
> and using information such as the sstables sizes and sstablemetadata to get
> the tombstone ratio.
>
> C*heers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France / Spain
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2018-04-02 14:55 GMT+01:00 James Shaw <jxys...@gmail.com>:
>
>> you may use:  nodetool upgradesstables -a keyspace_name table_name
>> it will re-write this table's sstable files to current version, while
>> re-writing, will evit droppable tombstones (expired +  gc_grace_seconds
>> (default 10 days) ), if partition cross different files, they will still
>> be kept, but most droppable tombstones gone and size reduced.
>> It works well for ours.
>>
>>
>>
>> On Mon, Apr 2, 2018 at 12:45 AM, Jon Haddad <j...@jonhaddad.com> wrote:
>>
>>> You’ll find the answers to your questions (and quite a bit more) in this
>>> blog post from my coworker: http://thelastpickle
>>> .com/blog/2016/07/27/about-deletes-and-tombstones.html
>>>
>>> Repair doesn’t clean up tombstones, they’re only removed through
>>> compaction.  I advise taking care with nodetool compact, most of the time
>>> it’s not a great idea for a variety of reasons.  Check out the above post,
>>> if you still have questions, ask away.
>>>
>>>
>>> On Apr 1, 2018, at 9:41 PM, Xiangfei Ni <xiangfei...@cm-dt.com> wrote:
>>>
>>> Hi All,
>>>   I want to delete the expired tombstone, someone uses nodetool repair
>>> ,but someone uses compact,so I want to know which one is the correct way,
>>>   I have read the below pages from Datastax,but the page just tells us
>>> how to use the command,but doesn’t tell us what it is exactly dose,
>>>   https://docs.datastax.com/en/cassandra/3.0/cassandra/tools
>>> /toolsRepair.html
>>>could anybody tell me how to clean the tombstone and give me some
>>> materials include the detailed instruction about the nodetool command and
>>> options?Web link is also ok.
>>>   Thanks very much
>>> Best Regards,
>>>
>>> 倪项菲*/ **David Ni*
>>> 中移德电网络科技有限公司
>>>
>>> Virtue Intelligent Network Ltd, co.
>>> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
>>> Mob: +86 13797007811 <+86%20137%209700%207811>|Tel: + 86 27 5024 2516
>>> <+86%2027%205024%202516>
>>>
>>>
>>>
>>
>

Re: nodetool repair and compact

2018-04-02 Thread Alain RODRIGUEZ

Hi,

it will re-write this table's sstable files to current version, while
> re-writing, will evit droppable tombstones (expired +  gc_grace_seconds
> (default 10 days) ), if partition cross different files, they will still
> be kept, but most droppable tombstones gone and size reduced.
>

Nice tip James, I never thought about doing this, it could have been handy
:).

Now, these compactions can be automatically done using the proper tombstone
compaction settings in most cases. Generally, tombstone compaction is
enabled, but if tombstone eviction is still an issue, you might want to
give a try enabling 'unchecked_tombstone_compaction' in the table options.
This might claim quite a lot of disk space (depending on the sstable
overlapping levels).

In case manual action is really needed (even more if it is run
automatically), I would recommend using 'User Defined Compactions' - UDC
(accessible through JMX at least) instead of 'uprade sstable':

- It will remove the tombstones the same way, but with no side effect if
you are currently upgrading for example. If  'upgradesstable' is run as a
routine operation, you might forget about it and suffer consequences.
'upgradesstable' is not only doing the compaction.
- With UDC, you can trigger the compaction of the sstables you want to
remove the tombstones from, instead of compacting *all* the sstables for a
given table.

This last point can prevent harming the cluster with useless compaction,
and even allow the operator to do things like: 'Compact the 10% biggest
sstables, that have an estimated tombstone ratio above 0.5, every day' or
'compact any sstable having more than 75% of tombstones' as you see fit,
and using information such as the sstables sizes and sstablemetadata to get
the tombstone ratio.

C*heers,
---
Alain Rodriguez - @arodream - al...@thelastpickle.com
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com

2018-04-02 14:55 GMT+01:00 James Shaw <jxys...@gmail.com>:

> you may use:  nodetool upgradesstables -a keyspace_name table_name
> it will re-write this table's sstable files to current version, while
> re-writing, will evit droppable tombstones (expired +  gc_grace_seconds
> (default 10 days) ), if partition cross different files, they will still
> be kept, but most droppable tombstones gone and size reduced.
> It works well for ours.
>
>
>
> On Mon, Apr 2, 2018 at 12:45 AM, Jon Haddad <j...@jonhaddad.com> wrote:
>
>> You’ll find the answers to your questions (and quite a bit more) in this
>> blog post from my coworker: http://thelastpickle
>> .com/blog/2016/07/27/about-deletes-and-tombstones.html
>>
>> Repair doesn’t clean up tombstones, they’re only removed through
>> compaction.  I advise taking care with nodetool compact, most of the time
>> it’s not a great idea for a variety of reasons.  Check out the above post,
>> if you still have questions, ask away.
>>
>>
>> On Apr 1, 2018, at 9:41 PM, Xiangfei Ni <xiangfei...@cm-dt.com> wrote:
>>
>> Hi All,
>>   I want to delete the expired tombstone, someone uses nodetool repair
>> ,but someone uses compact,so I want to know which one is the correct way,
>>   I have read the below pages from Datastax,but the page just tells us
>> how to use the command,but doesn’t tell us what it is exactly dose,
>>   https://docs.datastax.com/en/cassandra/3.0/cassandra/tools
>> /toolsRepair.html
>>could anybody tell me how to clean the tombstone and give me some
>> materials include the detailed instruction about the nodetool command and
>> options?Web link is also ok.
>>   Thanks very much
>> Best Regards,
>>
>> 倪项菲*/ **David Ni*
>> 中移德电网络科技有限公司
>>
>> Virtue Intelligent Network Ltd, co.
>> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
>> Mob: +86 13797007811 <+86%20137%209700%207811>|Tel: + 86 27 5024 2516
>> <+86%2027%205024%202516>
>>
>>
>>
>

1 2 3 4 5 6 7 >

1 - 100 of 685 matches

Mail list logo