Re: TWCS Log Warning

2024-05-23 Thread Jon Haddad
As an aside, if you're not putting a TTL on your data, it's a good idea to
be proactive and use multiple tables.  For example, one per month or year.
This allows you the flexibility to delete your data by dropping old tables.

Storing old data in Cassandra is expensive.  Once you get to a certain
point it becomes far more cost effective to offload your old data to an
object store and keep your Cassandra cluster to a minimum size.

I gave a talk on this topic on my YT channel:
https://www.youtube.com/live/Ysfi3V2KQtU

Jon


On Thu, May 23, 2024 at 7:35 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> As the log level name "DEBUG" suggested, these are debug messages, not
> warnings.
>
> Is there any reason made you believe that these messages are warnings?
>
>
> On 23/05/2024 11:10, Isaeed Mohanna wrote:
>
> Hi
>
> I have a big table (~220GB reported by used space live by tablestats) with
> time series data that uses TWCS with the following settings
> compaction = {'class':
> 'org.apache.cassandra.db.compaction.TimeWindowCompactionStrategy',
> 'compaction_window_size': '7', 'compaction_window_unit': 'DAYS',
> 'max_threshold': '32', 'min_threshold': '4'}
>
> The table does not have a TTL configured since we need the data, it now
> has ~450 sstables, I have had this setup for several years and so far I am
> satisfied with the performance, we mostly read\write data from the previous
> several months. Requests for earlier data occur but not in the quantities
> and performance is less critical then.
>
> I have recently noticed reoccurring warning in the Cassandra log file and
> I wanted to ask about their meaning and wither I need to do something about
> it
>
> DEBUG [CompactionExecutor:356242] 2024-05-23 09:01:59,655
> TimeWindowCompactionStrategy.java:129 - TWCS skipping check for fully
> expired SSTables
> DEBUG [CompactionExecutor:356242] 2024-05-23 09:01:59,655
> TimeWindowCompactionStrategy.java:129 - TWCS skipping check for fully
> expired SSTables
> DEBUG [CompactionExecutor:356243] 2024-05-23 09:02:59,655
> TimeWindowCompactionStrategy.java:122 - TWCS expired check sufficiently far
> in the past, checking for fully expired SSTables
> DEBUG [CompactionExecutor:356243] 2024-05-23 09:02:59,658
> TimeWindowCompactionStrategy.java:122 - TWCS expired check sufficiently far
> in the past, checking for fully expired SSTables
> DEBUG [CompactionExecutor:356242] 2024-05-23 09:03:59,655
> TimeWindowCompactionStrategy.java:129 - TWCS skipping check for fully
> expired SSTables
> DEBUG [CompactionExecutor:356242] 2024-05-23 09:03:59,656
> TimeWindowCompactionStrategy.java:129 - TWCS skipping check for fully
> expired SSTables
> DEBUG [CompactionExecutor:356245] 2024-05-23 09:05:00,490
> TimeWindowCompactionStrategy.java:129 - TWCS skipping check for fully
> expired SSTables
> DEBUG [CompactionExecutor:356245] 2024-05-23 09:05:00,490
> TimeWindowCompactionStrategy.java:129 - TWCS skipping check for fully
> expired SSTables
> DEBUG [CompactionExecutor:356244] 2024-05-23 09:06:00,490
> TimeWindowCompactionStrategy.java:129 - TWCS skipping check for fully
> expired SSTables
>
>
>
> The debug messages above appear in one of my Cassandra nodes every several
> minutes, I have a 4 node cluster with RF=3.
>
> Is there anything I need to do about those messages or its safe to ignore
> them
>
> Thank you for the help
>
>


Re: Replication factor, LOCAL_QUORUM write consistency and materialized views

2024-05-17 Thread Jon Haddad
I strongly suggest you don't use materialized views at all.  There are edge
cases that in my opinion make them unsuitable for production, both in terms
of cluster stability as well as data integrity.

Jon

On Fri, May 17, 2024 at 8:58 AM Gábor Auth  wrote:

> Hi,
>
> I know, I know, the materialized view is experimental... :)
>
> So, I ran into a strange error. Among others, I have a very small 4-nodes
> cluster, with very minimal data (~100 MB at all), the keyspace's
> replication factor is 3, everything is works fine... except: if I restart a
> node, I get a lot of errors with materialized views and consistency level
> ONE, but only for those tables for which there is more than one
> materialized view.
>
> Tables without materialized view don't have it, works fine.
> Tables that have it, but only one materialized view, also works fine.
> But, a table with more than one materialized view, whoops, the cluster
> crashes temporarily, I can also see on the calling side (Java backend) that
> no nodes are responding:
>
> Caused by: com.datastax.driver.core.exceptions.WriteFailureException:
> Cassandra failure during write query at consistency LOCAL_QUORUM (2
> responses were required but only 1 replica responded, 2 failed)
>
> I am surprised by this behavior, because there is so little data involved,
> and it occurs when there is more than one materialized view only, so it
> might be a concurrency issue under the hood.
>
> Have you seen an issue like this?
>
> Here is a stack trace on the Cassandra's side:
>
> [cassandra-dc03-1] ERROR [MutationStage-1] 2024-05-17 08:51:47,425
> Keyspace.java:652 - Unknown exception caught while attempting to update
> MaterializedView! pope.unit
> [cassandra-dc03-1] org.apache.cassandra.exceptions.UnavailableException:
> Cannot achieve consistency level ONE
> [cassandra-dc03-1]  at
> org.apache.cassandra.exceptions.UnavailableException.create(UnavailableException.java:37)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicas(ReplicaPlans.java:170)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.assureSufficientLiveReplicasForWrite(ReplicaPlans.java:113)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:354)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:345)
> [cassandra-dc03-1]  at
> org.apache.cassandra.locator.ReplicaPlans.forWrite(ReplicaPlans.java:339)
> [cassandra-dc03-1]  at
> org.apache.cassandra.service.StorageProxy.wrapViewBatchResponseHandler(StorageProxy.java:1312)
> [cassandra-dc03-1]  at
> org.apache.cassandra.service.StorageProxy.mutateMV(StorageProxy.java:1004)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.view.TableViews.pushViewReplicaUpdates(TableViews.java:167)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Keyspace.applyInternal(Keyspace.java:647)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Keyspace.applyFuture(Keyspace.java:477)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.Mutation.applyFuture(Mutation.java:210)
> [cassandra-dc03-1]  at
> org.apache.cassandra.db.MutationVerbHandler.doVerb(MutationVerbHandler.java:58)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
> [cassandra-dc03-1]  at
> org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:432)
> [cassandra-dc03-1]  at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
> [cassandra-dc03-1]  at
> org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
> [cassandra-dc03-1]  at
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> [cassandra-dc03-1]  at java.base/java.lang.Thread.run(Unknown Source)
>
> --
> Bye,
> Gábor AUTH
>


Re: Change num_tokens in a live cluster

2024-05-16 Thread Jon Haddad
Unless your cluster is very small, using the method of adding / removing
nodes will eventually result in putting a much larger portion of your
dataset on a very few number of nodes.  I *highly* discourage this.

The only correct, safe path is Bowen's suggestion of adding another DC and
decommissioning the old one.

Jon

On Thu, May 16, 2024 at 1:37 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> You can also add a new DC with the desired number of nodes and num_tokens
> on each node with auto bootstrap disabled, then rebuild the new DC from the
> existing DC before decommission the existing DC. This method only needs to
> copy data once, and can copy from/to multiple nodes concurrently, therefore
> is significantly faster, at the cost of doubling the number of nodes
> temporarily.
> On 16/05/2024 09:21, Gábor Auth wrote:
>
> Hi.
>
> Is there a newer/easier workflow to change num_tokens in an existing
> cluster than add a new node to the cluster with the other num_tokens value
> and decommission an old one, repeat and rinse through all nodes?
>
> --
> Bye,
> Gábor AUTH
>
>


Re: storage engine series

2024-04-30 Thread Jon Haddad
Thanks Aaron!

Just realized I made a mistake, the 4th week's URL is
https://www.youtube.com/watch?v=MAxQ0QygcKk.

Jon

On Tue, Apr 30, 2024 at 4:58 AM Aaron Ploetz  wrote:

> Nice! This sounds awesome, Jon.
>
> On Mon, Apr 29, 2024 at 6:25 PM Jon Haddad  wrote:
>
>> Hey everyone,
>>
>> I'm doing a 4 week YouTube series on the C* storage engine.  My first
>> video was last week where I gave an overview into some of the storage
>> engine internals [1].
>>
>> The next 3 weeks are looking at the new Trie indexes coming in 5.0 [2],
>> running Cassandra on EBS [3], and finally looking at some potential
>> optimizations [4] that could be done to improve things even further in the
>> future.
>>
>> I hope these videos are useful to the community, and I welcome feedback!
>>
>> Jon
>>
>> [1] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
>> [2] https://www.youtube.com/live/ZdzwtH0cJDE?si=CumcPny2UG8zwtsw
>> [3] https://www.youtube.com/live/kcq1TC407U4?si=pZ8AkXkMzIylQgB6
>> [4] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
>>
>


storage engine series

2024-04-29 Thread Jon Haddad
Hey everyone,

I'm doing a 4 week YouTube series on the C* storage engine.  My first video
was last week where I gave an overview into some of the storage engine
internals [1].

The next 3 weeks are looking at the new Trie indexes coming in 5.0 [2],
running Cassandra on EBS [3], and finally looking at some potential
optimizations [4] that could be done to improve things even further in the
future.

I hope these videos are useful to the community, and I welcome feedback!

Jon

[1] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T
[2] https://www.youtube.com/live/ZdzwtH0cJDE?si=CumcPny2UG8zwtsw
[3] https://www.youtube.com/live/kcq1TC407U4?si=pZ8AkXkMzIylQgB6
[4] https://www.youtube.com/live/yj0NQw9DgcE?si=ra1zqusMdSs6vl4T


Trie Memtables

2024-04-09 Thread Jon Haddad
Hey all,

Tomorrow at 10:30am PDT I'm taking a look at Trie Memtables tomorrow on my
live stream.  I'll do some performance comparisons between it and the
legacy SkipListMemtable implementation and see what I can learn.

https://www.youtube.com/live/Jp5R_-uXORQ?si=NnIoV3jqjHFoD8nF

or if you prefer a LinkedIn version:
https://www.linkedin.com/events/7183580733750304768/comments/

Jon


Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Jon Haddad
You shouldn’t decom an entire DC before removing it from replication.

—

Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user <
user@cassandra.apache.org> wrote:

> Hello community,
>
> In our deployments, we usually rebuild the Cassandra datacenters for
> maintenance or recovery operations.
>
> The procedure used since the days of Cassandra 3.x was the one documented
> in datastax documentation. Decommissioning a datacenter | Apache
> Cassandra 3.x (datastax.com)
> <https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsDecomissionDC.html>
>
> After upgrading to Cassandra 4.1.4, we have realized that there are some
> stricter rules that do not allo to remove the replication when active
> Cassandra nodes still exist in a datacenter.
>
> This check makes the above-mentioned procedure obsolete.
>
> I am thinking to use the following as an alternative:
>
>1. Make sure no clients are still writing to any nodes in the
>datacenter.
>2. Run a full repair with nodetool repair.
>3. Run nodetool decommission using the --force option on every node in
>the datacenter being removed.
>4. Change all keyspaces so they no longer reference the datacenter
>being removed.
>
>
>
> What is the procedure followed by other users? Do you see any risk
> following the proposed procedure?
>
>
>
> BR
>
> MK
>


Re: Query on Performance Dip

2024-04-05 Thread Jon Haddad
Try changing the chunk length parameter on the compression settings to 4kb,
and reduce read ahead to 16kb if you’re using EBS or 4KB if you’re using
decent local ssd or nvme.

Counters read before write.

—
Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Fri, Apr 5, 2024 at 9:27 AM Subroto Barua  wrote:

> follow up question on performance issue with 'counter writes'- is there a
> parameter or condition that limits the allocation rate for
> 'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.
>
> The back-end infra is same for both the clusters and same test cases/data
> model.
> On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad <
> j...@jonhaddad.com> wrote:
>
>
> Hi,
>
> Unfortunately, the numbers you're posting have no meaning without
> context.  The speculative retries could be the cause of a problem, or you
> could simply be executing enough queries and you have a fairly high
> variance in latency which triggers them often.  It's unclear how many
> queries / second you're executing and there's no historical information to
> suggest if what you're seeing now is an anomaly or business as usual.
>
> If you want to determine if your theory that speculative retries are
> causing your performance issue, then you could try changing speculative
> retry to a fixed value instead of a percentile, such as 50MS.  It's easy
> enough to try and you can get an answer to your question almost immediately.
>
> The problem with this is that you're essentially guessing based on very
> limited information - the output of a nodetool command you've run "every
> few secs".  I prefer to use a more data driven approach.  Get a CPU flame
> graph and figure out where your time is spent:
> https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
>
> The flame graph will reveal where your time is spent, and you can focus on
> improving that, rather than looking at a random statistic that you've
> picked.
>
> I just gave a talk at SCALE on distributed systems performance
> troubleshooting.  You'll be better off following a methodical process than
> guessing at potential root causes, because the odds of you correctly
> guessing the root cause in a system this complex is close to zero.  My talk
> is here: https://www.youtube.com/watch?v=VX9tHk3VTLE
>
> I'm guessing you don't have dashboards in place if you're relying on
> nodetool output with grep.  If your cluster is under 6 nodes, you can take
> advantage of AxonOps's free tier: https://axonops.com/
>
> Good dashboards are essential for these types of problems.
>
> Jon
>
>
>
> On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:
>
> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u  -pw  -p 
> tablestats  | grep -i 'Speculative retries'
>
>
>
> Speculative retries: 11633
>
> ..
>
> ..
>
> Speculative retries: 13727
>
>
>
> Speculative retries: 14256
>
> Speculative retries: 14855
>
> Speculative retries: 14858
>
> Speculative retries: 14859
>
> Speculative retries: 14873
>
> Speculative retries: 14875
>
> Speculative retries: 14890
>
> Speculative retries: 14893
>
> Speculative retries: 14896
>
> Speculative retries: 14901
>
> Speculative retries: 14905
>
> Speculative retries: 14946
>
> Speculative retries: 14948
>
> Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
> we are seeing similar perf issues with counter writes - to reproduce:
>
> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
> threads=50 -mode native cql3 user= password= -name 
>
>
> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
> Total GC count: 750 (4.1) and 744 (4.0)
> Avg GC time: 106 ms (4.1) and 

Re: Query on Performance Dip

2024-03-30 Thread Jon Haddad
Hi,

Unfortunately, the numbers you're posting have no meaning without context.
The speculative retries could be the cause of a problem, or you could
simply be executing enough queries and you have a fairly high variance in
latency which triggers them often.  It's unclear how many queries / second
you're executing and there's no historical information to suggest if what
you're seeing now is an anomaly or business as usual.

If you want to determine if your theory that speculative retries are
causing your performance issue, then you could try changing speculative
retry to a fixed value instead of a percentile, such as 50MS.  It's easy
enough to try and you can get an answer to your question almost immediately.

The problem with this is that you're essentially guessing based on very
limited information - the output of a nodetool command you've run "every
few secs".  I prefer to use a more data driven approach.  Get a CPU flame
graph and figure out where your time is spent:
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/

The flame graph will reveal where your time is spent, and you can focus on
improving that, rather than looking at a random statistic that you've
picked.

I just gave a talk at SCALE on distributed systems performance
troubleshooting.  You'll be better off following a methodical process than
guessing at potential root causes, because the odds of you correctly
guessing the root cause in a system this complex is close to zero.  My talk
is here: https://www.youtube.com/watch?v=VX9tHk3VTLE

I'm guessing you don't have dashboards in place if you're relying on
nodetool output with grep.  If your cluster is under 6 nodes, you can take
advantage of AxonOps's free tier: https://axonops.com/

Good dashboards are essential for these types of problems.

Jon



On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:

> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u  -pw  -p 
> tablestats  | grep -i 'Speculative retries'
>
>
>
> Speculative retries: 11633
>
> ..
>
> ..
>
> Speculative retries: 13727
>
>
>
> Speculative retries: 14256
>
> Speculative retries: 14855
>
> Speculative retries: 14858
>
> Speculative retries: 14859
>
> Speculative retries: 14873
>
> Speculative retries: 14875
>
> Speculative retries: 14890
>
> Speculative retries: 14893
>
> Speculative retries: 14896
>
> Speculative retries: 14901
>
> Speculative retries: 14905
>
> Speculative retries: 14946
>
> Speculative retries: 14948
>
> Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
>> we are seeing similar perf issues with counter writes - to reproduce:
>>
>> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
>> threads=50 -mode native cql3 user= password= -name 
>>
>>
>> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
>> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
>> Total GC count: 750 (4.1) and 744 (4.0)
>> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>>
>>
>> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
>> goel.ra...@gmail.com> wrote:
>>
>>
>> Hi All,
>>
>> Was going through this mail chain
>> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>>  and was wondering that if this could cause a performance degradation in
>> 4.1 without changing compactionThroughput.
>>
>> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>>
>> Regards
>> Ranju
>>
>


Tomorrow 10AM PDT - Examining LWT perf in 5.0

2024-03-19 Thread Jon Haddad
Hey folks,

I'm doing a working session tomorrow at 10am PDT, testing LWTs in C* 5.0.
I'll be running benchmarks and doing some performance analysis.  Come hang
out and bring your questions!

Jon

YouTube: https://www.youtube.com/watch?v=IoWh647LRQ0

LinkedIn:
https://www.linkedin.com/events/cassandra5workingsession-lightw7174223694586687490/comments/


Streaming a working session with 5.0 - UCS

2024-03-05 Thread Jon Haddad
Hey everyone,

Today starting at 10am PT I'm going to be streaming my session messing with
5.0, looking at UCS.  I'm doing this with my easy-cass-lab and
easy-cass-stress tools using a build of C* from last night.  I'll also show
some of the cool things you can do with my tools.

I'll be running these tests with ZGC and the new BTI table format so if
you're interested in either of those, so please bring your questions and
drop them in comments. If there's something you want to know, this is a
good time to ask b/c I'm willing to absolutely wreck this environment, for
science.

See you there,
Jon

YouTube: https://www.youtube.com/watch?v=UwRoXlbJrDA

LinkedIn (please share for reach):
https://www.linkedin.com/events/workingsession-handsonwithapach7168769189904674816/comments/


Re: Check out new features in K8ssandra and Mission Control

2024-02-27 Thread Jon Haddad
Hey Chris - this looks pretty interesting!  It looks like there's a lot of
functionality in here.

* What aspects of Mission Control are dependent on using K8ssandra?
* Can Mission Control work without K8ssandra?
* Is mission control open source?
* I'm not familiar with Vector - does it require an agent?
* Is Reaper deployed separately or integrated in?

Thanks!  Looking forward to trying this out.
Jon


On Tue, Feb 27, 2024 at 7:07 AM Christopher Bradford 
wrote:

> Hey C* folks,
>
> I'm excited to share that the DataStax team has just released Mission
> Control , a new operations
> platform for running Apache Cassandra and DataStax Enterprise. Built around
> the open source core of K8ssandra  we've been hard
> at work expanding multi-region capabilities. If you haven't seen some of
> the new features coming in here are some highlights:
>
>
>-
>
>Management API support in Reaper - no more JMX credentials, YAY
>-
>
>Additional support for TLS across the stack- including operator to
>node, Reaper to management API, etc
>-
>
>Updated metrics pipeline - removal of collectd from nodes, Vector for
>monitoring log files (goodbye tail -f)
>-
>
>Deterministic node selection for cluster operations
>-
>
>Top-level management tasks in the control plane (no more forced
>connections to data planes to trigger a restart)
>
>
> On top of this Mission Control offers:
>
>
>-
>
>A single web-interface to monitor and manage your clusters wherever
>they're deployed
>-
>
>Automatic management of internode and operator to node certificates -
>this includes integration with third party CAs and rotation of all
>certificates, keys, and various Java stores
>-
>
>Centralized metrics and logs aggregation, querying and storage with
>the capability to split the pipeline allowing for exporting of streams to
>other observability tools within your environment
>-
>
>Per-node configuration (this is an edge case, but still something we
>wanted to make possible)
>
>
> While building our Mission Control, K8ssandra has seen a number of
> releases with quite a few contributions from the community. From Helm chart
> updates to operator tweaks we want to send out a huge THANK YOU to everyone
> who has filed issues, opened pull requests, and helped us test bugfixes and
> new functionality.
>
> If you've been sleeping on K8ssandra, now is a good time to check it out
> . It has all of the pieces needed to
> run Cassandra in production. Looking for something out of the box instead
> of putting the pieces together yourself, take Mission Control for a spin
> and sign up for the trial
> . I'm happy to
> answer any K8ssandra or Mission Control questions you may have here or on
> our Discord .
>
> Cheers,
>
> ~Chris
>
> Christopher Bradford
>
>


stress testing & lab provisioning tools

2024-02-26 Thread Jon Haddad
Hey everyone,

Over the last several months I've put a lot of work into 2 projects I
started back at The Last Pickle, for stress testing Cassandra and for
building labs in AWS.  You may know them as tlp-stress and tlp-cluster.

Since I haven't worked at TLP in almost half a decade, and am the primary /
sole person investing time, I've rebranded them to easy-cass-stress and
easy-cass-lab.  There's been several major improvements in both projects
and I invite you to take a look at both of them.

easy-cass-stress

Many of you are familiar with tlp-stress.  easy-cass-stress is a fork /
rebrand of the project that uses almost the same familiar interface as
tlp-stress, but with some improvements.  easy-cass-stress is even easier to
use, requiring less guessing to the parameters to help you figure out your
performance profile.  Instead of providing a -c flag (for in-flight
concurrency) you can now simply provide your max read and write latencies
and it'll figure out the throughput it can get on its own or used fixed
rate scheduling like many other benchmarking tools have.  The adaptive
scheduling is based on a Netflix Tech Blog post, but slightly modified to
be sensitive to latency metrics instead of just errors.   You can read more
about some of my changes here:
https://rustyrazorblade.com/post/2023/2023-10-31-tlp-stress-adaptive-scheduler/

GH repo: https://github.com/rustyrazorblade/easy-cass-stress

easy-cass-lab

This is a powerful tool that makes it much easier to spin up lab
environments using any released version of Cassandra, with functionality
coming to test custom branches and trunk.  It's a departure from the old
tlp-cluster that installed and configured everything at runtime.  By
creating a universal, multi-version AMI complete with all my favorite
debugging tools, it's now possible to create a lab environment in under 2
minutes in AWS.  The image includes easy-cass-stress making it
straightforward to spin up clusters to test existing releases, and soon
custom builds and trunk.  Fellow committer Jordan West has been working on
this with me and we've made a ton of progress over the last several weeks.
 For a demo check out my working session live stream last week where I
fixed a few issues and discussed the potential and development path for the
tool: https://youtu.be/dPtsBut7_MM

GH repo: https://github.com/rustyrazorblade/easy-cass-lab

I hope you find these tools as useful as I have.  I am aware of many
extremely large Cassandra teams using tlp-stress with their 1K+ node
environments, and hope the additional functionality in easy-cass-stress
makes it easier for folks to start benchmarking C*, possibly in conjunction
with easy-cass-lab.

Looking forward to hearing your feedback,
Jon


Re: Remove folders of deleted tables

2023-12-05 Thread Jon Haddad
I can't think of a reason to keep empty directories around, seems like a 
reasonable change, but I don't think you're butting up against a thing that 
most people would run into, as snapshots are enabled by default (auto_snapshot: 
true) and almost nobody changes it.  

The use case you described isn't handled well by Cassandra for a host of other 
reasons, and I would *never* do that in a production environment with any 
released version.  The folder thing is the least of the issues you'll run into, 
so even if you contribute a patch and address it, I'd still wouldn't do it 
until transactional cluster metadata gets released and I've had a chance to 
kick the tires to see what issues you run into besides schema inconsistencies.  
I suspect the drivers won't love it either.

Assuming you're running into an issue now:

find . -type d -empty -exec rmdir {} \;

rmdir only removes empty directories, you'll need to run it twice (once for 
backup, once for the empty table).  It will remove all empty directories in 
that folder so if you've got unused tables, you'd be better off using the find 
command, getting the list, removing the active tables from it and explicitly 
running the rmdir command with the directories you want cleaned up.

Jon

On 2023/12/04 19:55:06 Sébastien Rebecchi wrote:
> Thank you Dipan.
> 
> Do you know if there is a good reason for Cassandra to let tables folder
> even when there is no snapshot?
> 
> I'm thinking of use cases where there is the need to create and delete
> small tables at a high rate. You could quickly end with more than 65K
> (limit of ext4) subdirectories in the KS directory, while 99.9.. % of them
> are residual of deleted tables.
> 
> That looks quite dirty from Cassandra to not clean its own "garbage" by
> itself, and quite dangerous for the end user to have to do it alone, don't
> you think so?
> 
> Thanks,
> 
> Sébastien.
> 
> Le lun. 4 déc. 2023, 11:28, Dipan Shah  a écrit :
> 
> > Hello Sebastien,
> >
> > There are no inbuilt tools that will automatically remove folders of
> > deleted tables.
> >
> > Thanks,
> >
> > Dipan Shah
> > --
> > *From:* Sébastien Rebecchi 
> > *Sent:* 04 December 2023 13:54
> > *To:* user@cassandra.apache.org 
> > *Subject:* Remove folders of deleted tables
> >
> > Hello,
> >
> > When we delete a table with Cassandra, it lets the folder of that table on
> > file system, even if there is no snapshot (auto snapshots disabled).
> > So we end with the empty folder {data folder}/{keyspace name}/{table
> > name-table id} containing only 1  subfolder, backups, which is itself empty.
> > Is there a way to automatically remove folders of deleted tables?
> >
> > Sébastien.
> >
> 


Re: Memory and caches

2023-11-27 Thread Jon Haddad
I haven't found chunk cache to be particularly useful.  It's a fairly small 
cache that could only help when you're dealing with a small hot dataset.  I 
wouldn't bother increasing memory for it.

Key cache can be helpful, but it depends on the workload.  I generally 
recommend optimizing for your HW first for the case where you don't hit cache.  

Generally, cache is used to make up for issues with bottlenecked I/O.  If you 
haven't already done so, I recommend taking a look at what you're actually 
doing in terms of device I/O (bitehist), compare that to what's being requested 
to your filesystem (ebpf probe + histo on vfs_read) and looking at your page 
cache hit rate with cachestat.  You're likely to find you've got a ton of read 
amplification due to either misconfigured compression or read ahead, both of 
which can saturate your disks and make it appear like you need to give more 
memory to cache.  I always recommend optimizing things for the worst cache (all 
cache misses) then use cache to improve things vs papering over an underlying 
perf issue.

I wrote a bunch about this recently:

https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
https://rustyrazorblade.com/post/2023/2023-11-14-bcc-tools/
https://rustyrazorblade.com/post/2023/2023-11-21-bpftrace/

Jon

On 2023/11/27 14:59:55 Sébastien Rebecchi wrote:
> Hello
> 
> When I use nodetool info, it prints that relevant information
> 
> Heap Memory (MB)   : 14229.31 / 32688.00
> Off Heap Memory (MB)   : 5390.57
> Key Cache  : entries 670423, size 100 MiB, capacity 100 MiB,
> 13152259 hits, 47205855 requests, 0.279 recent hit rate, 14400 save period
> in seconds
> Chunk Cache: entries 63488, size 992 MiB, capacity 992 MiB,
> 143250511 misses, 162302465 requests, 0.117 recent hit rate, 2497.557
> microseconds miss latency
> 
> Here I focus on lines relevant for that conversation. And the numbers are
> roughly the same for all nodes of the cluster.
> The key and chunk caches are full and the hit rate is low. At the same time
> the heap memory is far from being used at full capacity.
> I would say that I can significantly increase the sizes of those caches in
> order to increase hit rate and improve performance.
> In cassandra.yaml, key_cache_size_in_mb has a blank value, so 100 MiB by
> default, and file_cache_size_in_mb is set to 1024.
> I'm thinking about setting key_cache_size_in_mb to 1024
> and file_cache_size_in_mb to 2048. What would you recommend? Is anyone
> having good experience with tuning those parameters?
> 
> Thank you in advance.
> 
> Sébastien.
> 


Re: Running Large Clusters in Production

2020-07-10 Thread Jon Haddad
I worked on a handful of large clusters (> 200 nodes) using vnodes, and
there were some serious issues with both performance and availability.  We
had to put in a LOT of work to fix the problems.

I agree with Jeff - it's way better to manage multiple clusters than a
really large one.


On Fri, Jul 10, 2020 at 2:49 PM Jeff Jirsa  wrote:

> 1000 instances are fine if you're not using vnodes.
>
> I'm not sure what the limit is if you're using vnodes.
>
> If you might get to 1000, shard early before you get there. Running 8x100
> host clusters will be easier than one 800 host cluster.
>
>
> On Fri, Jul 10, 2020 at 2:19 PM Isaac Reath (BLOOMBERG/ 919 3RD A) <
> ire...@bloomberg.net> wrote:
>
>> Hi All,
>>
>> I’m currently dealing with a use case that is running on around 200
>> nodes, due to growth of their product as well as onboarding additional data
>> sources, we are looking at having to expand that to around 700 nodes, and
>> potentially beyond to 1000+. To that end I have a couple of questions:
>>
>> 1) For those who have experienced managing clusters at that scale, what
>> types of operational challenges have you run into that you might not see
>> when operating 100 node clusters? A couple that come to mind are version
>> (especially major version) upgrades become a lot more risky as it no longer
>> becomes feasible to do a blue / green style deployment of the database and
>> backup & restore operations seem far more error prone as well for the same
>> reason (having to do an in-place restore instead of being able to spin up a
>> new cluster to restore to).
>>
>> 2) Is there a cluster size beyond which sharding across multiple clusters
>> becomes the recommended approach?
>>
>> Thanks,
>> Isaac
>>
>>


Re: Upgrading cassandra cluster from 2.1 to 3.X when using custom TWCS

2020-07-09 Thread Jon Haddad
You could also pull TWCS out of the version of Cassandra you want to
deploy, fix the imports and change the package name.  Then you've got the
same version as OSS, just under the name you're using in 2.1.  Once you've
moved to 3.11, you can switch to the OSS version.

On Thu, Jul 9, 2020 at 9:09 AM Gil Ganz  wrote:

> Great, thank you very much!
>
> On Thu, Jul 9, 2020 at 7:02 PM Oleksandr Shulgin <
> oleksandr.shul...@zalando.de> wrote:
>
>> On Thu, Jul 9, 2020 at 5:54 PM Gil Ganz  wrote:
>>
>>> That sounds very interesting Alex, so just to be sure I understand, it
>>> was like this
>>> 1 - you had 2.1 cluster running with the 2.1 version jar
>>> 2 - you upgraded to 3.0, starting the cluster with the 3.0 version jar
>>> that has the same strategy name
>>> 3 - you changed the compaction strategy to use the built in one?
>>>
>>
>> Correct.
>>
>> Another question, did changing the compaction strategy from one twcs to
>>> the other trigger merging of old sstables?
>>>
>>
>> I don't recall any unexpected action from changing the strategy, but of
>> course you should verify on a test system first if you have one.
>>
>> Cheers,
>> --
>> Alex
>>
>>


Re: Cassandra upgrade from 3.11.3 -> 3.11.6

2020-06-24 Thread Jon Haddad
Generally speaking, don't run mixed versions longer than you have to, and
don't upgrade that way.

Why?

* We don't support it.
* We don't even test it.
* If you run into trouble and ask for help, the first thing people will
tell you is to get all nodes on the same version.

Anyone that's doing so that didn't specifically read the source and test it
out for themselves only got lucky in that they didn't hit any issues.  If
you do it, and hit issues, be prepared to get very familiar with the C*
source as you're on your own.

Be smart and go the supported, well traveled route.  You'll need to do it
when upgrading majors *anyways*, so you might as well figure out the right
way of doing it *today* and follow the same stable method every time you
upgrade.



On Wed, Jun 24, 2020 at 8:36 AM Jai Bheemsen Rao Dhanwada <
jaibheem...@gmail.com> wrote:

> Thank you all for the suggestions.
>
> I am not trying to scale up the cluster for capacity but for the upgrade
> process instead of in place upgrade I am planning to add nodes with 3.11.6
> and then decommission  the nodes with 3.11.3.
>
> On Wednesday, June 24, 2020, Durity, Sean R 
> wrote:
>
>> Streaming operations (repair/bootstrap) with different file versions is
>> usually a problem. Running a mixed version cluster is fine – for the time
>> you are doing the upgrade. I would not stay on mixed versions for any
>> longer than that. It takes more time, but I separate out the admin tasks so
>> that I can reason what should happen. I would either scale up or upgrade
>> (depending on which is more urgent), then do the other.
>>
>>
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* manish khandelwal 
>> *Sent:* Wednesday, June 24, 2020 5:52 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* [EXTERNAL] Re: Cassandra upgrade from 3.11.3 -> 3.11.6
>>
>>
>>
>> Rightly said by Surbhi, it is not good to scale with mixed versions as
>> debugging issues will be very difficult.
>>
>> Better to upgrade first and then scale.
>>
>>
>>
>> Regards
>>
>>
>>
>> On Wed, Jun 24, 2020 at 11:20 AM Surbhi Gupta 
>> wrote:
>>
>> In case of any issue, it gets very difficult to debug when we have
>> multiple versions.
>>
>>
>>
>> On Tue, 23 Jun 2020 at 22:23, Jürgen Albersdorfer <
>> jalbersdor...@gmail.com> wrote:
>>
>> Hi, I would say „It depends“ - as it always does. I have had a 21 Node
>> Cluster running in Production in one DC with versions ranging from 3.11.1
>> to 3.11.6 without having had any single issue for over a year. I just
>> upgraded all nodes to 3.11.6 for the sake of consistency.
>>
>> Von meinem iPhone gesendet
>>
>>
>>
>> Am 24.06.2020 um 02:56 schrieb Surbhi Gupta :
>>
>> 
>>
>>
>>
>> Hi ,
>>
>>
>>
>> We have recently upgraded from 3.11.0 to 3.11.5 . There is a sstable
>> format change from 3.11.4 .
>>
>> We also had to expand the cluster and we also discussed about expansion
>> first and than upgrade. But finally we upgraded and than expanded.
>>
>> As per our experience what I could tell you is, it is not advisable to
>> add new nodes on higher version.
>>
>> There are many bugs which got fixed from 3.11.3 to 3.11.6.
>>
>>
>>
>> Thanks
>>
>> Surbhi
>>
>>
>>
>> On Tue, Jun 23, 2020 at 5:04 PM Jai Bheemsen Rao Dhanwada <
>> jaibheem...@gmail.com> wrote:
>>
>> Hello,
>>
>>
>>
>> I am trying to upgrade from 3.11.3 to 3.11.6.
>>
>> Can I add new nodes with the 3.11.6  version to the cluster running with
>> 3.11.3?
>>
>> Also, I see the SSTable format changed from mc-* to md-*, does this cause
>> any issues?
>>
>>
>>
>>
>> --
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>


Re: Generating evenly distributed tokens for vnodes

2020-05-29 Thread Jon Haddad
I'm on mobile now so I might be mistaken, but I don't think nodetool move
works with multiple tokens

On Fri, May 29, 2020, 1:48 PM Kornel Pal  wrote:

> Hi Anthony,
>
> Thank you very much for looking into using the script for initial token
> generation and for providing multiple detailed methods of expanding the
> cluster.
>
> This helps a lot, indeed.
>
> Regards,
> Kornel
> Anthony Grasso wrote:
>
> Hi Kornel,
>
> Great use of the script for generating initial tokens! I agree that you
> can achieve an optimal token distribution in a cluster using such a method.
>
> One thing to think about is the process for expanding the size of the
> cluster in this case. For example consider the scenario where you wanted to
> insert a single new node into the cluster. To do this you would need to
> calculate what the new token ranges should be for the nodes including the
> new node. You would then need to reassign existing tokens to other nodes
> using 'nodetool move'. You would likely need to call this command a few
> times to do a few movements in order to achieve the newly calculated token
> assignments. Once the "gap" in the token ranges has been created, you would
> then update the initial_token property for the existing nodes in the
> cluster. Finally, you could then insert the new node with the assigned
> tokens.
>
> While the above process could be used to maintain an optimal token
> distribution in a cluster, it does increase operational overhead. This is
> where allocate_tokens_for_keyspace and
> allocate_tokens_for_local_replication_factor (4.0 only) play a critical
> role. They save the operational overhead when changing the size of the
> cluster. In addition, from my experience they do a pretty good job at
> keeping the token ranges evenly distributed when expanding the cluster.
> Even in the case where a low number for num_tokens is used. If expanding
> the cluster size is required during an emergency, using the
> allocate_token_* setting would be the most simple and reliable way to
> quickly insert a node while maintaining reasonable token distribution.
>
> The only other way to expand the cluster and maintain even token
> distribution without using an allocate_token_* setting, is to double the
> size of the cluster each time. Obviously this has its own draw backs in
> terms of increase costs to both money and time compared to inserting a
> single node.
>
> Hope this helps.
>
> Kind regards,
> Anthony
>
> On Thu, 28 May 2020 at 04:52, Kornel Pal  wrote:
>
>> As I understand, the previous discussion is about using
>> allocate_tokens_for_keyspace for allocating tokens for most of the
>> nodes. On the other hand, I am proposing to generate all the tokens for
>> all the nodes using a Python script.
>>
>> This seems to result in perfectly even token ownership distribution
>> across all the nodes for all possible replication factors, thus being an
>> improvement over using allocate_tokens_for_keyspace.
>>
>> Elliott Sims wrote:
>> > There's also a slightly older mailing list discussion on this subject
>> > that goes into detail on this sort of strategy:
>> > https://www.mail-archive.com/user@cassandra.apache.org/msg60006.html
>> >
>> > I've been approximately following it, repeating steps 3-6 for the first
>> > host in each "rack(replica, since I have 3 racks and RF=3) then 8-10
>> for
>> > the remaining hosts in the new datacenter.  So far, so good (sample
>> size
>> > of 1) but it's a pretty painstaking process
>> >
>> > This should get a lot simpler with Cassandra 4+'s
>> > "allocate_tokens_for_local_replication_factor" option, which will
>> > default to 3.
>> >
>> > On Wed, May 27, 2020 at 4:34 AM Kornel Pal > > > wrote:
>> >
>> > Hi,
>> >
>> > Generating ideal tokens for single-token datacenters is well
>> understood
>> > and documented, but there is much less information available on
>> > generating tokens with even ownership distribution when using
>> vnodes.
>> > The best description I could find on token generation for vnodes is
>> >
>> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>> >
>> > While allocate_tokens_for_keyspace results in much more even
>> ownership
>> > distribution than random allocation, and does a great job at
>> balancing
>> > ownership when adding new nodes, using it for creating a new
>> datacenter
>> > results in less than ideal ownership distribution.
>> >
>> > After some experimentation, I found that it is possible to generate
>> all
>> > the tokens for a new datacenter with an extended version of the
>> Python
>> > script presented in the above blog post. Using these tokens seem to
>> > result in perfectly even ownership distribution with various
>> > token/node/rack configurations for all possible replication factors.
>> >
>> > Murmur3Partitioner:
>> >   >>> datacenter_offset = 0
>> >   >>> num_tokens = 4
>> >   >>> num_racks 

Re: Performance of Data Types used for Primary keys

2020-03-06 Thread Jon Haddad
It's not going to matter at all.

On Fri, Mar 6, 2020, 2:15 AM Hanauer, Arnulf, Vodacom South Africa
(External)  wrote:

> Hi Cassandra folks,
>
>
>
> Is there any difference in performance of general operations if using a
> TEXT based Primary key versus a BIGINT Primary key.
>
>
>
> Our use-case requires low latency reads but currently the Primary key is
> TEXT based but the data could work on BIGINT. We are trying to optimise
> where possible.
>
> Any experiences that could point to a winner?
>
>
>
>
>
> Kind regards
> Arnulf Hanauer
>
>
>
>
>
>
>
>
>
>
> "This e-mail is sent on the Terms and Conditions that can be accessed by
> Clicking on this link https://webmail.vodacom.co.za/tc/default.html
>  "
>


Re: Deleting data from future

2020-03-02 Thread Jon Haddad
You can issue a delete using a future timestamp.

http://cassandra.apache.org/doc/latest/cql/dml.html#grammar-token-update-parameter

Look for USING TIMESTAMP.

Jon

On Mon, Mar 2, 2020, 3:28 AM Furkan Cifci  wrote:

> Greetings,
> In our C* cluster, one node lost time sync and it went to future(16 Mar
> 2020) for a while.
> After fixing timesync, we couldnt update or delete records which were
> inserted while node's system time was from the future.
> Inspecting sstables, we found that  timestamp(TS field value) of these
> records was 1584349956844022,  Monday, 16 March 2020 09:12:36.844.
> We neither delete these records nor truncate table.
> Is there anyway to manipulate records inside sstable manually?
>
>
>
>
>


Re: Should we use Materialised Views or ditch them ?

2020-02-28 Thread Jon Haddad
I also recommend avoiding them.  I've seen too many clusters fall over as a
result of their usage.

On Fri, Feb 28, 2020 at 9:52 AM Max C.  wrote:

> The general view of the community is that you should *NOT* use them in
> production, due to multiple serious outstanding issues (see Jira).  We used
> them quite a bit when they first came out and have since rolled back all
> uses except for the absolute most basic cases (ex:  a table with 30K rows
> that isn’t updated).  If we were to do it over, we would not use them at
> all.
>
> - Max
>
> On Feb 28, 2020, at 7:07 am, Tobias Eriksson 
> wrote:
>
> Hi
>  A debate has surfaced in my company, whether to keep or remove
> Materialized Views
> The Datastax FAQ says sure thing, go ahead and use it
> https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/faqMV.html
> But know the limitations
>
> https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/knownLimitationsMV.html
> and best practices
> https://docs.datastax.com/en/dse/5.1/cql/cql/cql_using/bestPracticesMV.html
>
> What is the community take on using MV(Materialized Views) in production ?
>
> -Tobias
>
>
>


Re: New seed node in the cluster immediately UN without passing for UJ state

2020-02-13 Thread Jon Haddad
Seeds don't bootstrap, don't list new nodes as seeds.

On Thu, Feb 13, 2020 at 5:23 PM Sergio  wrote:

> Hi guys!
>
> I don't know how but this is the first time that I see such behavior. I
> wanted to add a new node in the cluster and it looks to be working fine but
> instead to wait for 2-3 hours data streaming like 100GB it immediately went
> to the UN (UP and NORMAL) state.
>
> I saw a bunch of exception in the logs and WARN
>  [MessagingService-Incoming-/10.1.17.126] 2020-02-14 01:08:07,812
> IncomingTcpConnection.java:103 - UnknownColumnFamilyException reading from
> socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table
> for cfId a5af88d0-24f6-11e9-b009-95ed77b72f6e. If a table was just created,
> this is likely due to the schema not being fully propagated.  Please wait
> for schema agreement on table creation.
> at
> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1525)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at
> org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:850)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at
> org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:825)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:415)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:434)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:371)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
> at
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)
> ~[apache-cassandra-3.11.5.jar:3.11.5]
>
> but in the end, it is working...
>
> Suggestion?
>
> Thanks,
>
> Sergio
>


Re: [EXTERNAL] Cassandra 3.11.X upgrades

2020-02-12 Thread Jon Haddad
A while ago, on my first cluster, I decided to do an upgrade by adding
nodes running 1.2 to an existing cluster running version 1.1.  This was a
bad decision, and at that point I decided to always play it safe and always
stick to a single version, and never bootstrap in a node running different
version to a cluster.

I would do this even when it's a trivial upgrade, or a single commit is
different only touching a few lines of code.

Why?

I do this because in my opinion it's better to have a single way of doing
things.  Getting in the routine of doing minor upgrades by following Sean's
upgrade checklist will always work.  It's the supported, correct way of
doing things, and if you decide to always follow the same checklist, the
possibility of doing things wrong when there is an
incompatibility decreases.  Think of it as practice.

The only time you should even consider mixed version bootstrap upgrades is
if you know enough to not have to ask the list.  Even then, I go back to my
previous point about practice.  No point in practicing the less safe
version of things.

My 2 cents.

Jon



On Wed, Feb 12, 2020 at 11:02 AM Sergio  wrote:

> Thanks for your reply!
>
> So unless the sstable format has not been changed I can avoid to do that.
>
> Correct?
>
> Best,
>
> Sergio
>
> On Wed, Feb 12, 2020, 10:58 AM Durity, Sean R 
> wrote:
>
>> Check the readme.txt for any upgrade notes, but the basic procedure is to:
>>
>>- Verify that nodetool upgradesstables has completed successfully on
>>all nodes from any previous upgrade
>>- Turn off repairs and any other streaming operations (add/remove
>>nodes)
>>- Stop an un-upgraded node (seeds first, preferably)
>>- Install new binaries and configs on the down node
>>- Restart that node and make sure it comes up clean (it will function
>>normally in the cluster – even with mixed versions)
>>- Repeat for all nodes
>>- Run upgradesstables on each node (as many at a time as your load
>>will allow). Minor upgrades usually don’t require this step (only if the
>>sstable format has changed), but it is good to check.
>>- NOTE: in most cases applications can keep running and will not
>>notice much impact – unless the cluster is overloaded and a single node
>>down causes impact.
>>
>>
>>
>>
>>
>>
>>
>> Sean Durity – Staff Systems Engineer, Cassandra
>>
>>
>>
>> *From:* Sergio 
>> *Sent:* Wednesday, February 12, 2020 11:36 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* [EXTERNAL] Cassandra 3.11.X upgrades
>>
>>
>>
>> Hi guys!
>>
>> How do you usually upgrade your cluster for minor version upgrades?
>>
>> I tried to add a node with 3.11.5 version to a test cluster with 3.11.4
>> nodes.
>>
>> Is there any restriction?
>>
>> Best,
>>
>> Sergio
>>
>> --
>>
>> The information in this Internet Email is confidential and may be legally
>> privileged. It is intended solely for the addressee. Access to this Email
>> by anyone else is unauthorized. If you are not the intended recipient, any
>> disclosure, copying, distribution or any action taken or omitted to be
>> taken in reliance on it, is prohibited and may be unlawful. When addressed
>> to our clients any opinions or advice contained in this Email are subject
>> to the terms and conditions expressed in any applicable governing The Home
>> Depot terms of business or client engagement letter. The Home Depot
>> disclaims all responsibility and liability for the accuracy and content of
>> this attachment and for any damages or losses arising from any
>> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
>> items of a destructive nature, which may be contained in this attachment
>> and shall not be liable for direct, indirect, consequential or special
>> damages in connection with this e-mail message or its attachment.
>>
>


Re: [RELEASE] Apache Cassandra 4.0-alpha3 released

2020-02-07 Thread Jon Haddad
Thanks for handling this, Mick!

On Fri, Feb 7, 2020 at 12:02 PM Mick Semb Wever  wrote:

>
>
> The Cassandra team is pleased to announce the release of Apache Cassandra
> version 4.0-alpha3.
>
> Apache Cassandra is a fully distributed database. It is the right choice
> when you need scalability and high availability without compromising
> performance.
>
>  http://cassandra.apache.org/
>
> Downloads of source and binary distributions are listed in our download
> section:
>  http://cassandra.apache.org/download/
>
>
> Downloads of source and binary distributions:
>
> http://www.apache.org/dyn/closer.lua/cassandra/4.0-alpha3/apache-cassandra-4.0-alpha3-bin.tar.gz
>
> http://www.apache.org/dyn/closer.lua/cassandra/4.0-alpha3/apache-cassandra-4.0-alpha3-src.tar.gz
>
> Debian and Redhat configurations.
>
>   sources.list:
>   deb http://www.apache.org/dist/cassandra/debian 40x main
>
>   yum config:
>   baseurl=https://www.apache.org/dist/cassandra/redhat/40x/
>
> See http://cassandra.apache.org/download/ for full install instructions.
>
> This is an ALPHA version! It is not intended for production use, however
> the project would appreciate your testing and feedback to make the final
> release better. As always, please pay attention to the release notes[2]
> and let us know[3] if you encounter any problems.
>
> Enjoy!
>
> [1]: CHANGES.txt
> ?p=cassandra.git;a=blob_plain;f=CHANGES.txt;hb=refs/tags/cassandra-4.0-alpha3
> [2]: NEWS.txt
> ?p=cassandra.git;a=blob_plain;f=NEWS.txt;hb=refs/tags/cassandra-4.0-alpha3
> [3]: https://issues.apache.org/jira/browse/CASSANDRA
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Overload because of hint pressure + MVs

2020-02-07 Thread Jon Haddad
There's a few things you can do here that might help.

First off, if you're using the default heap settings, that's a serious
problem.  If you've got the head room, my recommendation is to use 16GB
heap with 12 GB new gen and pin your memtable heap space to 2GB.  Set your
max tenuring threshold to 6 and your survivor ratio to 6.  You don't need a
lot of old gen space with cassandra, almost everything that will show up
there is memtable related, and we allocate a *lot* whenever we read data
off disk.

Most folks use the default disk read ahead setting of 128KB.  You can check
this setting using blockdev --report, under the RA column.  You'll see 256
there, that's in 512 byte sectors.  MVs rely on a read before a write, so
for every read off disk you do, you'll pull additional 128KB into your page
cache.  This is usually a waste and puts WAY too much pressure on your
disk.  On SSD, I always change this to 4KB.

Next, be sure you're setting your compression rate accordingly.  I wrote a
long post on the topic here:
https://thelastpickle.com/blog/2018/08/08/compression_performance.html.
Our default compression is very unfriendly for read heavy workloads if
you're reading small rows.  If your records are small, 4KB compression
chunk length is your friend.

I have some slides showing pretty good performance improvements from the
above 2 changes.  Specifically, I went from 16K reads a second at 180ms p99
latency up to 63K reads / second at 21ms p99.  Disk usage dropped by a
factor of 10.  Throw in those JVM changes I recommended and things should
improve even further.

Generally speaking, I recommend avoiding MVs, as they can be a giant mine
if you aren't careful.  They're not doing any magic behind the scenes that
makes scaling easier, and in a lot of cases they're a hinderance.  You
still need to understand the underlying data and how it's laid out to use
them properly, which is 99% of the work.

Jon

On Fri, Feb 7, 2020 at 10:32 AM Michael Shuler 
wrote:

> That JIRA still says Open, so no, it has not been fixed (unless there's
> a fixed duplicate in JIRA somewhere).
>
> For clarification, you could update that ticket with a comment including
> your environmental details, usage of MV, etc. I'll bump the priority up
> and include some possible branchX fixvers.
>
> Michael
>
> On 2/7/20 10:53 AM, Surbhi Gupta wrote:
> > Hi,
> >
> > We are getting hit by the below bug.
> > Other than lowering hinted_handoff_throttle_in_kb to 100 any other work
> > around ?
> >
> > https://issues.apache.org/jira/browse/CASSANDRA-13810
> >
> > Any idea if it got fixed in later version.
> > We are on Open source Cassandra 3.11.1  .
> >
> > Thanks
> > Surbhi
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Question on large partition key

2019-12-31 Thread Jon Haddad
I suggest checking out Aaron Morton's post on the 3.0 storage engine.

https://thelastpickle.com/blog/2016/03/04/introductiont-to-the-apache-cassandra-3-storage-engine.html

On Tue, Dec 31, 2019 at 11:20 AM Subroto Barua 
wrote:

> I have a table ---
>
> create Table mytable (
>
> Id text,
>
> cdate timestamp,
>
> Tk text,
>
> Primary key (id, cdate)
>
> ) with clustering order by (cdate desc);
>
> One of the partition key has 2,099,414 rows; using the following formula:
>
> row_size = sum_of_all_columns_ size_within_row + partition_key_size
> row_size = 32bytes (string) + 8 + 32 == 72 bytes
>
> partition_size = row_ size_average * number_of_rows_in_this_partition
> partition_size = 72 * 2099414 = 147,615 KB
>
> Cassandra system log reports: 128,064,307 bytes for this key
>
> Can someone explain the gap? Did I make any wrong assumption in
> calculating the row size/pk size?
>
> C* version is 3.0.15
>
> Thanks,
>
> Subroto
>
>


Re: Streaming Failed during bootstrap of a Replacement node

2019-12-20 Thread Jon Haddad
Without getting too in the weeds here - ideally yes, you'd be able to
replace the node.  However, the error you're getting looks (at a glance) to
be one of the many range tombstone bugs that were fixed in later versions.

On Fri, Dec 20, 2019 at 11:23 AM Nethi, Manoj 
wrote:

> Hi Jon,
>
> Yes we will upgrade it soon. But before we can upgrade shouldn’t we get
> this lost node in the cluster to be replaced ?
>
>
>
>
>
>
>
> *From:* Jon Haddad 
> *Sent:* Friday, December 20, 2019 2:13 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Streaming Failed during bootstrap of a Replacement node
>
>
>
> *This email is from an external source - **exercise caution regarding
> links and attachments. *
>
>
>
> You should upgrade to Cassandra 3.11.5 before doing anything else.  You're
> running a pretty old and buggy version.  There's been hundreds (maybe
> thousands) of bugs fixed between 3.3 and 3.11.5.
>
>
>
> On Fri, Dec 20, 2019 at 10:46 AM Nethi, Manoj 
> wrote:
>
> Hi,
>
>
>
> We are seeing the following error while bootstrapping a node which is
> replacement of a failed node in a multi DC cluster.
>
>
>
>
>
>
>
> WARN  [STREAM-IN-/**.***.***.**] 2019-12-19 23:36:40,120
> StreamSession.java:641 - [Stream #f231db30-22da-11ea-b38a-0f6cfab62953]
> Retrying for following error
>
> java.lang.ArrayIndexOutOfBoundsException: -127
>
> at
> org.apache.cassandra.db.RangeTombstone$Bound$Serializer.deserialize(RangeTombstone.java:201)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:355)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:86)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:64)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.StreamReader$StreamDeserializer.hasNext(StreamReader.java:253)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:108)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:120)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.db.ColumnIndex.writeAndBuildIndex(ColumnIndex.java:57)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.format.RangeAwareSSTableWriter.append(RangeAwareSSTableWriter.java:99)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.StreamReader.writePartition(StreamReader.java:178)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.compress.CompressedStreamReader.read(CompressedStreamReader.java:106)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:50)
> [apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:39)
> [apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:59)
> [apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:261)
> [apache-cassandra-3.3.0.jar:3.3.0]
>
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74]
>
> ERROR [STREAM-IN-/**.***.***.**] 2019-12-19 23:36:40,120
> StreamSession.java:520 - [Stream #f231db30-22da-11ea-b38a-0f6cfab62953]
> Streaming error occurred
>
> java.lang.IllegalArgumentException: Unknown type 0
>
> at
> org.apache.cassandra.streaming.messages.StreamMessage$Type.get(StreamMessage.java:97)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:58)
> ~[apache-cassandra-3.3.0.jar:3

Re: Streaming Failed during bootstrap of a Replacement node

2019-12-20 Thread Jon Haddad
You should upgrade to Cassandra 3.11.5 before doing anything else.  You're
running a pretty old and buggy version.  There's been hundreds (maybe
thousands) of bugs fixed between 3.3 and 3.11.5.

On Fri, Dec 20, 2019 at 10:46 AM Nethi, Manoj 
wrote:

> Hi,
>
>
>
> We are seeing the following error while bootstrapping a node which is
> replacement of a failed node in a multi DC cluster.
>
>
>
>
>
>
>
> WARN  [STREAM-IN-/**.***.***.**] 2019-12-19 23:36:40,120
> StreamSession.java:641 - [Stream #f231db30-22da-11ea-b38a-0f6cfab62953]
> Retrying for following error
>
> java.lang.ArrayIndexOutOfBoundsException: -127
>
> at
> org.apache.cassandra.db.RangeTombstone$Bound$Serializer.deserialize(RangeTombstone.java:201)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.db.rows.UnfilteredSerializer.deserialize(UnfilteredSerializer.java:355)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:86)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.SSTableSimpleIterator$CurrentFormatIterator.computeNext(SSTableSimpleIterator.java:64)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.StreamReader$StreamDeserializer.hasNext(StreamReader.java:253)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:108)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.db.ColumnIndex$Builder.build(ColumnIndex.java:120)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.db.ColumnIndex.writeAndBuildIndex(ColumnIndex.java:57)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.io.sstable.format.RangeAwareSSTableWriter.append(RangeAwareSSTableWriter.java:99)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.StreamReader.writePartition(StreamReader.java:178)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.compress.CompressedStreamReader.read(CompressedStreamReader.java:106)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:50)
> [apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.messages.IncomingFileMessage$1.deserialize(IncomingFileMessage.java:39)
> [apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:59)
> [apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:261)
> [apache-cassandra-3.3.0.jar:3.3.0]
>
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74]
>
> ERROR [STREAM-IN-/**.***.***.**] 2019-12-19 23:36:40,120
> StreamSession.java:520 - [Stream #f231db30-22da-11ea-b38a-0f6cfab62953]
> Streaming error occurred
>
> java.lang.IllegalArgumentException: Unknown type 0
>
> at
> org.apache.cassandra.streaming.messages.StreamMessage$Type.get(StreamMessage.java:97)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.messages.StreamMessage.deserialize(StreamMessage.java:58)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at
> org.apache.cassandra.streaming.ConnectionHandler$IncomingMessageHandler.run(ConnectionHandler.java:261)
> ~[apache-cassandra-3.3.0.jar:3.3.0]
>
> at java.lang.Thread.run(Thread.java:745) [na:1.8.0_74]
>
> INFO  [STREAM-IN-/**.***.***.**] 2019-12-19 23:36:40,124
> StreamResultFuture.java:185 - [Stream
> #f231db30-22da-11ea-b38a-0f6cfab62953] Session with /**.***.**.** is
> complete
>
> WARN  [STREAM-IN-/**.***.***.**] 2019-12-19 23:36:40,124
> StreamResultFuture.java:212 - [Stream
> #f231db30-22da-11ea-b38a-0f6cfab62953] Stream failed
>
> WARN  [Thread-738] 2019-12-19 23:36:40,126 CompressedInputStream.java:182
> - Error while reading compressed input stream.
>
> java.nio.channels.ClosedChannelException: null
>
> at
> sun.nio.ch.SocketChannelImpl.ensureReadOpen(SocketChannelImpl.java:257)
> ~[na:1.8.0_74]
>
> at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:300)
> ~[na:1.8.0_74]
>
> at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:59)
> ~[na:1.8.0_74]
>
> at sun.nio.ch.ChannelInputStream.read(ChannelInputStream.java:109)
> ~[na:1.8.0_74]
>
>   

Re: execute is faster than execute_async?

2019-12-11 Thread Jon Haddad
I'm not sure how you're measuring this - could you share your benchmarking
code?

I ask because execute calls execute_async under the hood:
https://github.com/datastax/python-driver/blob/master/cassandra/cluster.py#L2316

I tested the python driver a ways back and found some weird behavior due to
the way it's non blocking code was implemented.  IIRC there were some sleep
calls thrown in there to get around Python's threading inadequacies.  I
can't remember if this code path is avoided when you use the execute() call.

Jon


On Wed, Dec 11, 2019 at 3:09 AM lampahome  wrote:

> I submit 1 row for 40960 times by session.execute() and
> session.execute_async()
>
> I found total time of execute() is always fast than execute_async
>
> Does that make sense? Or I miss the details of theri?
>


Re: Connection Pooling in v4.x Java Driver

2019-12-10 Thread Jon Haddad
I'm not sure how closely the driver maintainers are following this list.
You might want to ask on the Java Driver mailing list:
https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user




On Tue, Dec 10, 2019 at 5:10 PM Caravaggio, Kevin <
kevin.caravag...@lowes.com> wrote:

> Hello,
>
>
>
>
>
> When integrating with DataStax OSS Cassandra Java driver v4.x, I noticed 
> “Unlike
> previous versions of the driver, pools do not resize dynamically”
> 
> in reference to the connection pool configuration. Is anyone aware of the
> reasoning for this departure from dynamic pool sizing, which I believe was
> available in v3.x?
>
>
>
>
>
> Thanks,
>
>
>
>
>
> Kevin
>
>
> --
> NOTICE: All information in and attached to the e-mails below may be
> proprietary, confidential, privileged and otherwise protected from improper
> or erroneous disclosure. If you are not the sender's intended recipient,
> you are not authorized to intercept, read, print, retain, copy, forward, or
> disseminate this message. If you have erroneously received this
> communication, please notify the sender immediately by phone (704-758-1000)
> or by e-mail and destroy all copies of this message electronic, paper, or
> otherwise. By transmitting documents via this email: Users, Customers,
> Suppliers and Vendors collectively acknowledge and agree the transmittal of
> information via email is voluntary, is offered as a convenience, and is not
> a secured method of communication; Not to transmit any payment information
> E.G. credit card, debit card, checking account, wire transfer information,
> passwords, or sensitive and personal information E.G. Driver's license,
> DOB, social security, or any other information the user wishes to remain
> confidential; To transmit only non-confidential information such as plans,
> pictures and drawings and to assume all risk and liability for and
> indemnify Lowe's from any claims, losses or damages that may arise from the
> transmittal of documents or including non-confidential information in the
> body of an email transmittal. Thank you.
>


Re: AWS ephemeral instances + backup

2019-12-05 Thread Jon Haddad
You can easily do this with bcache or LVM
http://rustyrazorblade.com/post/2018/2018-04-24-intro-to-lvm/.

Medusa might be a good route to go down if you want to do backups instead:
https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html



On Thu, Dec 5, 2019 at 12:21 PM Carl Mueller
 wrote:

> Does anyone have experience tooling written to support this strategy:
>
> Use case: run cassandra on i3 instances on ephemerals but synchronize the
> sstables and commitlog files to the cheapest EBS volume type (those have
> bad IOPS but decent enough throughput)
>
> On node replace, the startup script for the node, back-copies the sstables
> and commitlog state from the EBS to the ephemeral.
>
> As can be seen:
> https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html
>
> the (presumably) spinning rust tops out at 2375 MB/sec (using multiple EBS
> volumes presumably) that would incur about a ten minute delay for node
> replacement for a 1TB node, but I imagine this would only be used on higher
> IOPS r/w nodes with smaller densities, so 100GB would be about a minute of
> delay only, already within the timeframes of an AWS node
> replacement/instance restart.
>
>
>


Re: Cassandra 4 alpha/alpha2

2019-11-01 Thread Jon Haddad
A new thing like this would be much better served by the community through
several iterations.  For instance, over the last year I've developed a tool
for spinning up lab clusters, it's here:
https://thelastpickle.com/tlp-cluster/

I had to make a *lot* of tradeoffs here.  Everything Jeff mentioned, plus a
handful of others.  I'm pretty sure if someone where to try to accomplish
the same thing they'd go about it a different way.  It might be better or
worse in a variety of ways.

I agree with Jeff that these are problems better solved elsewhere, at least
till an idea matures enough to adopt it.  Otherwise we're creating a lot of
technical debt that needs to be maintained, when we really need about a
dozen prototypes that get thrown out first.

Sharing work with the community, whether it be through packer, AMIs, or
home made tooling is the first step here.

Jon

On Fri, Nov 1, 2019 at 11:47 AM Jeff Jirsa  wrote:

> Lots of this, but also getting into the weeds, it's pretty clear this is
> nontrivial:
>
> - If we did an AWS AMI, would we also do Azure? GCP? AliCloud? OCI? Where
> do we stop?
> - What if there's a security hole in the base image - who's responsible
> for fixing that? We could have tooling that makes a new one every day, but
> that tooling has to run somewhere, who's going to pay for it?
> - What base OS? Do we do Amazon Linux or CentOS or Ubuntu or Debian? How
> do we choose? PV or HVM?
> - Which region? How long do we keep it? If we're doing nightly AMIs to
> pick up security fixes in the base image, what do we do with old AMIs? If
> we yank them we may break people, if we dont they may be using something
> with a security hole.
>
> All of these are solvable problems, but we're just not at a point where
> we're going to solve them at the project level anytime soon.
>
>
>
>
> On Fri, Nov 1, 2019 at 8:01 AM Reid Pinchback 
> wrote:
>
>> That is indeed what Amazon AMIs are for.  
>>
>>
>>
>> However if your question is “why don’t the C* developers do that for
>> people?” the answer is going to be some mix of “people only do so much work
>> for free” and “the ones that don’t do it for free have a company you pay to
>> do things like that (Datastax)”.  Keep in mind, that when you create AMIs
>> you’re using AWS resources and whoever owns the account that did the work,
>> is on the hook to pay for the resources.
>>
>>
>>
>> But if your question is about whether you can do that for your own
>> company, then obviously yes.  And when you do so at first it’ll be about
>> C*, then it’ll be about how your company in particular likes to monitor
>> things, and handle backup, spec out encryption of data at rest, and deal
>> with auth security, and deal with log shipping, and deal with PII concerns,
>> and …
>>
>>
>>
>> Which is why there isn’t really a big win to other people setting up an
>> AMI for you, except in cases where they are offering
>> whatever-it-is-as-a-service and get paid for its usage.  1000 consumers
>> will say they want a simple thing, but all 1000 usages will be a little
>> different, and nobody will like the AMI they get if their simple thing
>> isn’t present on it.
>>
>>
>>
>> (plus AMI creation and maintenance, within and across regions, is just a
>> pain in the rump and I can’t imagine doing it without money coming back
>> from the effort)
>>
>>
>>
>>
>>
>> *From: *Sergio 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Thursday, October 31, 2019 at 4:09 PM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Re: Cassandra 4 alpha/alpha2
>>
>>
>>
>> *Message from External Sender*
>>
>> OOO but still relevant:
>> Would not it be possible to create an Amazon AMI that has all the OS and
>> JVM settings in the right place and from there each developer can tweak the
>> things that need to be adjusted?
>> Best,
>> Sergio
>>
>>
>>
>> Il giorno gio 31 ott 2019 alle ore 12:56 Abdul Patel 
>> ha scritto:
>>
>> Looks like i am messing up or missing something ..will revisit again
>>
>> On Thursday, October 31, 2019, Stefan Miklosovic <
>> stefan.mikloso...@instaclustr.com> wrote:
>>
>> Hi,
>>
>> I have tested both alpha and alpha2 and 3.11.5 on Centos 7.7.1908 and
>> all went fine (I have some custom images for my own purposes).
>>
>> Update between alpha and alpha2 was just about mere version bump.
>>
>> Cheers
>>
>> On Thu, 31 Oct 2019 at 20:40, Abdul Patel  wrote:
>> >
>> > Hey Everyone
>> >
>> > Did anyone was successfull to install either alpha or alpha2 version
>> for cassandra 4.0?
>> > Found 2 issues :
>> > 1> cassandra-env.sh:
>> > JAVA_VERSION varianle is not defined.
>> > Jvm-server.options file is not defined.
>> >
>> > This is fixable and after adding those , the error for cassandra-env.sh
>> errora went away.
>> >
>> > 2> second and major issue the cassandea binary when i try to start says
>> syntax error.
>> >
>> > /bin/cassandea: line 198:exec: : not found.
>> >
>> > Anyone has any idea on second issue?
>> >
>>
>> 

Re: Cassandra 4 alpha/alpha2

2019-10-31 Thread Jon Haddad
What artifact did you use and what OS are you on?

On Thu, Oct 31, 2019 at 12:40 PM Abdul Patel  wrote:

> Hey Everyone
>
> Did anyone was successfull to install either alpha or alpha2 version for
> cassandra 4.0?
> Found 2 issues :
> 1> cassandra-env.sh:
> JAVA_VERSION varianle is not defined.
> Jvm-server.options file is not defined.
>
> This is fixable and after adding those , the error for cassandra-env.sh
> errora went away.
>
> 2> second and major issue the cassandea binary when i try to start says
> syntax error.
>
> /bin/cassandea: line 198:exec: : not found.
>
> Anyone has any idea on second issue?
>
>


Re: What is the status of counters? Should I use them?

2019-10-30 Thread Jon Haddad
It's possible to overcount when a server is overwhelmed or slow to respond
and you're getting exceptions on the client.  If you retry your query, it's
possible you'll increment twice, once for the original query (which maybe
threw an exception) and again on the retry.

Use counters if you're OK with approximating values which are right _most
of the time_ and wrong _when the cluster is a dumpster fire_.  You can also
track your individual requests and reconcile the counters later on if you
want to eventually be right.  You may find you want to remove some counters
if they're bots or signs of abuse anyways so IMO this is a better approach
than blindly incrementing on the assumption no one is doing anything
nefarious.

I can't say for sure if there's an issue or not with repairs.  Given the
way they're written now I don't think there is, but I haven't had the need
to investigate it, and I don't see anything in JIRA to suggest it's a
problem with modern counters.  Someone else may know something I don't
though.

Jon

On Wed, Oct 30, 2019 at 9:41 AM  wrote:

> What about repairs? Can I just repair that table on a regular basis as any
> other?
>
>
>
> ‐‐‐ Original Message ‐‐‐
> On Wednesday, 30 October 2019 16:26, Jon Haddad  wrote:
>
> Counters are good for things like page views, bad for money.  Yes they can
> under or overcount in certain situations.  If your cluster is stable,
> you'll see very little of it in practice.
>
> I've done quite a bit of tuning of counters.  Here's the main takeaways:
>
> * They do a read before a write, so use low latency disks (SSD)
> * Dial back read ahead to 4KB, this is a big deal (in fact, always do this
> even if you're not using counters and you are using SSDs)
> * Use 4KB compression chunk length
> * Bump up you counter cache
> * Some basic JVM tuning (ParNew + CMS, 16GB heap 10GB new, max tenuring
> threshold 4, survivor ratio 6)
>
> The last 3 will give you a 10-20x perf improvement over stock Cassandra if
> you've got a lot of counters.
>
> Jon
>
>
>
> On Wed, Oct 30, 2019 at 7:01 AM  wrote:
>
>> Hi,
>>
>> I would like to use counters but I am not sure I should.
>>
>> I read a lot of articles on the Internet how counters are bad / wrong /
>> inaccurate etc etc ...
>>
>> Let's be honest, counters in Cassandra have quite a bad reputation.
>>
>> But all stuff I read about that was quite old, I know there was
>> significant improvements in that area especially around 2.1 / 2.2 releases
>> but I can not make my head around so I can definitely be sure if I should
>> use them or not.
>>
>> The literature I read were:
>>
>> 1) That one elaborates about counters from node-lifecycle perspective and
>> there are still some problems of over / undercounting.
>>
>> 2) This one explains the differences between pre and post 2.1
>> implementations and suggests that once counter caches are removed, the
>> implementation will be even better and simplified - but I am not sure what
>> is the outcome of this article? It says that all "wrong" implementation of
>> counters (as we knew them in pre 2.x era) was corrected and we should be
>> all good to use it?
>>
>> 3) These guys said that they have not found any bugs ... huh.
>>
>> So, what is the overall state of counters in 3.11.4 ? (hence 3.11.5)?
>> Would you recommend to use them in production?
>>
>> My usecase is that I have 2 DCs with 3 nodes each and I have a table
>> where I want to track number number of page visits.
>>
>> My perception is that "they will be inconsistent and you can not repair
>> it and it is idempotent" but from what I have tested, when I put 1 node
>> down and I brought it back and read it, it was just fine and numbers were
>> good.
>>
>> So I am not sure if my testing is very naive but the whole mystery about
>> counters and the lack of the authoritative advice what the general status
>> is and where it can go wrong is imho lacking.
>>
>> Are the links below obsolete? Do I have strong guarantee that counters
>> will "just work"? What are the downsides and why would not you use them?
>> Honestly, after reading a lot about that, I am not trusting counters too
>> much but I am not sure if my opinion is biased based on what I read so far.
>>
>> Thanks
>>
>> Links
>>
>> 1) http://datastrophic.io/evaluating-cassandra-2-1-counters-consistency/
>>
>> 2)
>> https://www.datastax.com/blog/2014/05/whats-new-cassandra-21-better-implementation-counters
>>
>> 3) https://www.datastax.com/blog/2016/01/testing-apache-cassandra-jepsen
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>
>>
>


Re: Where to get old RPMs?

2019-10-30 Thread Jon Haddad
Archives are here: http://archive.apache.org/dist/cassandra/

For example, the RPM for 3.11.x you can find here:
http://archive.apache.org/dist/cassandra/redhat/311x/

The old releases are removed by Apache automatically as part of their
policy, it's not specific to Cassandra.


On Wed, Oct 30, 2019 at 10:39 AM Reid Pinchback 
wrote:

> With the latest round of C* updates, the yum repo no longer has whatever
> the previous version is.  For environments that try to do more controlled
> stepping of release changes instead of just taking the latest, is there any
> URL for previous versions of RPMs?  Previous jars I can find easily enough,
> but not RPMs.
>
>
>


Re: What is the status of counters? Should I use them?

2019-10-30 Thread Jon Haddad
Counters are good for things like page views, bad for money.  Yes they can
under or overcount in certain situations.  If your cluster is stable,
you'll see very little of it in practice.

I've done quite a bit of tuning of counters.  Here's the main takeaways:

* They do a read before a write, so use low latency disks (SSD)
* Dial back read ahead to 4KB, this is a big deal (in fact, always do this
even if you're not using counters and you are using SSDs)
* Use 4KB compression chunk length
* Bump up you counter cache
* Some basic JVM tuning (ParNew + CMS, 16GB heap 10GB new, max tenuring
threshold 4, survivor ratio 6)

The last 3 will give you a 10-20x perf improvement over stock Cassandra if
you've got a lot of counters.

Jon



On Wed, Oct 30, 2019 at 7:01 AM  wrote:

> Hi,
>
> I would like to use counters but I am not sure I should.
>
> I read a lot of articles on the Internet how counters are bad / wrong /
> inaccurate etc etc ...
>
> Let's be honest, counters in Cassandra have quite a bad reputation.
>
> But all stuff I read about that was quite old, I know there was
> significant improvements in that area especially around 2.1 / 2.2 releases
> but I can not make my head around so I can definitely be sure if I should
> use them or not.
>
> The literature I read were:
>
> 1) That one elaborates about counters from node-lifecycle perspective and
> there are still some problems of over / undercounting.
>
> 2) This one explains the differences between pre and post 2.1
> implementations and suggests that once counter caches are removed, the
> implementation will be even better and simplified - but I am not sure what
> is the outcome of this article? It says that all "wrong" implementation of
> counters (as we knew them in pre 2.x era) was corrected and we should be
> all good to use it?
>
> 3) These guys said that they have not found any bugs ... huh.
>
> So, what is the overall state of counters in 3.11.4 ? (hence 3.11.5)?
> Would you recommend to use them in production?
>
> My usecase is that I have 2 DCs with 3 nodes each and I have a table where
> I want to track number number of page visits.
>
> My perception is that "they will be inconsistent and you can not repair it
> and it is idempotent" but from what I have tested, when I put 1 node down
> and I brought it back and read it, it was just fine and numbers were good.
>
> So I am not sure if my testing is very naive but the whole mystery about
> counters and the lack of the authoritative advice what the general status
> is and where it can go wrong is imho lacking.
>
> Are the links below obsolete? Do I have strong guarantee that counters
> will "just work"? What are the downsides and why would not you use them?
> Honestly, after reading a lot about that, I am not trusting counters too
> much but I am not sure if my opinion is biased based on what I read so far.
>
> Thanks
>
> Links
>
> 1) http://datastrophic.io/evaluating-cassandra-2-1-counters-consistency/
>
> 2)
> https://www.datastax.com/blog/2014/05/whats-new-cassandra-21-better-implementation-counters
>
> 3) https://www.datastax.com/blog/2016/01/testing-apache-cassandra-jepsen
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: TWCS and gc_grace_seconds

2019-10-26 Thread Jon Haddad
My coworker Radovan wrote up a post on the relationship between gc grace
and hinted handoff:
https://thelastpickle.com/blog/2018/03/21/hinted-handoff-gc-grace-demystified.html

Jon

On Sat, Oct 26, 2019 at 6:45 AM Hossein Ghiyasi Mehr 
wrote:

> It needs to change gc_grace_seconds carefully because it has side effect
> on hinted handoff.
>
> On Fri, Oct 18, 2019 at 5:04 PM Paul Chandler  wrote:
>
>> Hi Adarsh,
>>
>> You will have problems if you manually delete data when using TWCS.
>>
>> To fully understand why, I recommend reading this The Last Pickle post:
>> https://thelastpickle.com/blog/2016/12/08/TWCS-part1.html
>> And this post I wrote that dives deeper into the problems with deletes:
>> http://www.redshots.com/cassandra-twcs-must-have-ttls/
>>
>> Thanks
>>
>> Paul
>>
>> On 18 Oct 2019, at 14:22, Adarsh Kumar  wrote:
>>
>> Thanks Jeff,
>>
>>
>> I just checked with business and we have differences in having TTL. So it
>> will be manula purging always. We do not want to use LCS due to high IOs.
>> So:
>>
>>1. As the use case is of time series data model, TWCS will be give
>>some benefit (without TTL) and with frequent deleted data
>>2. Are there any best practices/recommendations to handle high number
>>of tombstones
>>3. Can we handle this use case  with STCS also (with some
>>configurations)
>>
>>
>> Thanks in advance
>>
>> Adarsh Kumar
>>
>> On Fri, Oct 18, 2019 at 11:46 AM Jeff Jirsa  wrote:
>>
>>> Is everything in the table TTL’d?
>>>
>>> Do you do explicit deletes before the data is expected to expire ?
>>>
>>> Generally speaking, gcgs exists to prevent data resurrection. But ttl’d
>>> data can’t be resurrected once it expires, so gcgs has no purpose unless
>>> you’re deleting it before the ttl expires. If you’re doing that, twcs won’t
>>> be able to drop whole sstables anyway, so maybe LCS will be less disk usage
>>> (but much higher IO)
>>>
>>> On Oct 17, 2019, at 10:36 PM, Adarsh Kumar  wrote:
>>>
>>> 
>>> Hi,
>>>
>>> We have a use case of time series data with TTL where we want to use
>>> TimeWindowCompactionStrategy because of its better management for TTL and
>>> tombstones. In this case, data we have is frequently deleted so we want to
>>> reduce gc_grace_seconds to reduce the tombstones' life and reduce pressure
>>> on storage. I have following questions:
>>>
>>>1. Do we always need to run repair for the table in reduced
>>>gc_grace_seconds or there is any other way to manage repairs in this vase
>>>2. Do we have any other strategy (or combination of strategies) to
>>>manage frequently deleted time-series data
>>>
>>> Thanks in advance.
>>>
>>> Adarsh Kumar
>>>
>>>
>>


Re: Repair Issues

2019-10-24 Thread Jon Haddad
There's some major warning signs for me with your environment.  4GB heap is
too low, and Cassandra 3.7 isn't something I would put into production.

Your surface area for problems is massive right now.  Things I'd do:

1. Never use incremental repair.  Seems like you've already stopped doing
them, but it's worth mentioning.
2. Upgrade to the latest JVM, that version's way out of date.
3. Upgrade to Cassandra 3.11.latest (we're voting on 3.11.5 right now).
4. Increase memory to 8GB minimum, preferably 12.

I usually don't like making a bunch of changes without knowing the root
cause of a problem, but in your case there's so many potential problems I
don't think it's worth digging into, especially since the problem might be
one of the 500 or so bugs that were fixed since this release.

Once you've done those things it'll be easier to narrow down the problem.

Jon


On Thu, Oct 24, 2019 at 4:59 PM Ben Mills  wrote:

> Hi Sergio,
>
> No, not at this time.
>
> It was in use with this cluster previously, and while there were no
> reaper-specific issues, it was removed to help simplify investigation of
> the underlying repair issues I've described.
>
> Thanks.
>
> On Thu, Oct 24, 2019 at 4:21 PM Sergio  wrote:
>
>> Are you using Cassandra reaper?
>>
>> On Thu, Oct 24, 2019, 12:31 PM Ben Mills  wrote:
>>
>>> Greetings,
>>>
>>> Inherited a small Cassandra cluster with some repair issues and need
>>> some advice on recommended next steps. Apologies in advance for a long
>>> email.
>>>
>>> Issue:
>>>
>>> Intermittent repair failures on two non-system keyspaces.
>>>
>>> - platform_users
>>> - platform_management
>>>
>>> Repair Type:
>>>
>>> Full, parallel repairs are run on each of the three nodes every five
>>> days.
>>>
>>> Repair command output for a typical failure:
>>>
>>> [2019-10-18 00:22:09,109] Starting repair command #46, repairing
>>> keyspace platform_users with repair options (parallelism: parallel, primary
>>> range: false, incremental: false, job threads: 1, ColumnFamilies: [],
>>> dataCenters: [], hosts: [], # of ranges: 12)
>>> [2019-10-18 00:22:09,242] Repair session
>>> 5282be70-f13d-11e9-9b4e-7f6db768ba9a for range
>>> [(-1890954128429545684,2847510199483651721],
>>> (8249813014782655320,-8746483007209345011],
>>> (4299912178579297893,6811748355903297393],
>>> (-8746483007209345011,-8628999431140554276],
>>> (-5865769407232506956,-4746990901966533744],
>>> (-4470950459111056725,-1890954128429545684],
>>> (4001531392883953257,4299912178579297893],
>>> (6811748355903297393,6878104809564599690],
>>> (6878104809564599690,8249813014782655320],
>>> (-4746990901966533744,-4470950459111056725],
>>> (-8628999431140554276,-5865769407232506956],
>>> (2847510199483651721,4001531392883953257]] failed with error [repair
>>> #5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2,
>>> [(-1890954128429545684,2847510199483651721],
>>> (8249813014782655320,-8746483007209345011],
>>> (4299912178579297893,6811748355903297393],
>>> (-8746483007209345011,-8628999431140554276],
>>> (-5865769407232506956,-4746990901966533744],
>>> (-4470950459111056725,-1890954128429545684],
>>> (4001531392883953257,4299912178579297893],
>>> (6811748355903297393,6878104809564599690],
>>> (6878104809564599690,8249813014782655320],
>>> (-4746990901966533744,-4470950459111056725],
>>> (-8628999431140554276,-5865769407232506956],
>>> (2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x
>>> (progress: 26%)
>>> [2019-10-18 00:22:09,246] Some repair failed
>>> [2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds
>>>
>>> Additional Notes:
>>>
>>> Repairs encounter above failures more often than not. Sometimes on one
>>> node only, though occasionally on two. Sometimes just one of the two
>>> keyspaces, sometimes both. Apparently the previous repair schedule for
>>> this cluster included incremental repairs (script alternated between
>>> incremental and full repairs). After reading this TLP article:
>>>
>>>
>>> https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html
>>>
>>> the repair script was replaced with cassandra-reaper (v1.4.0), which was
>>> run with its default configs. Reaper was fine but only obscured the ongoing
>>> issues (it did not resolve them) and complicated the debugging process and
>>> so was then removed. The current repair schedule is as described above
>>> under Repair Type.
>>>
>>> Attempts at Resolution:
>>>
>>> (1) nodetool scrub was attempted on the offending keyspaces/tables to no
>>> effect.
>>>
>>> (2) sstablescrub has not been attempted due to the current design of the
>>> Docker image that runs Cassandra in each Kubernetes pod - i.e. there is no
>>> way to stop the server to run this utility without killing the only pid
>>> running in the container.
>>>
>>> Related Error:
>>>
>>> Not sure if this is related, though sometimes, when either:
>>>
>>> (a) Running nodetool snapshot, or
>>> (b) Rolling a pod that runs a Cassandra node, which 

Re: merge two cluster

2019-10-23 Thread Jon Haddad
Probably not beneficial, I wouldn't do it.  Not a fan of multi-tenancy with
Cassandra unless the use cases are so small that your noisy neighbor
problem is not very noisy at all.  For those cases I don't know what you
get from Cassandra other than a cool resume.

On Wed, Oct 23, 2019 at 12:41 PM Reid Pinchback 
wrote:

> I haven’t seen much evidence that larger cluster = more performance, plus
> or minus the statistics of speculative retry.  It horizontally scales for
> storage definitely, and somewhat for connection volume.  If anything, per
> Sean’s observation, you have less ability to have a stable tuning for a
> particular usage pattern.
>
>
>
> Try to have a mental picture of what you think is happening in the JVM
> while Cassandra is running.  There are short-lived objects, medium-lived
> objects, long/static-lived objects, and behind the scenes some degree of
> read I/O and write I/O against disk.  Garbage collectors struggle badly
> with medium-lived objects, but Cassandra really depends a great deal on
> those.  If you merge two clusters together, within any one node you still
> have the JVM size and disk architecture you had before, but you are adding
> competition on fixed resources and potentially in the very way they find
> most difficult to handle.
>
>
>
> If those resources were heavily underutilized, like Sean’s point about
> merging small apps together, then sure.  But if those two clusters of yours
> are already showing that they experience significant load, then you are
> unlikely to improve anything, far more likely to end up worse off.  GC
> overhead and compaction flushes to disk are your challenges; merging two
> clusters doesn’t change the physics of those two areas, but could increase
> the demand on them.
>
>
>
> The only caveat to all of the above I can think of is if there was a
> fault-tolerance story motivating the merging.  Like “management wants us in
> two AZs in AWS, but lacks the budget for more instances, and each pool by
> itself is too small for us to come up with a 2 rack organization that makes
> sense”.
>
>
>
> R
>
>
>
> *From: *Osman YOZGATLIOĞLU 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Wednesday, October 23, 2019 at 10:40 AM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: merge two cluster
>
>
>
> *Message from External Sender*
>
> Sorry, missing question;
>
> Actually I'm asking this for performance perspective. At application level
> both cluster used at the same time and approx same level. Inserted data
> inserted to both cluster, different parts of course.
>
> If I merge two cluster, can I gain some performance improvements? Like
> raid stripes, more disk, more stripe, more speed..
>
>
>
> Regards
>
> On 23.10.2019 17:30, Durity, Sean R wrote:
>
> Beneficial to whom? The apps, the admins, the developers?
>
>
>
> I suggest that app teams have separate clusters per application. This
> prevents the noisy neighbor problem, isolates any security issues, and
> helps when it is time for maintenance, upgrade, performance testing, etc.
> to not have to coordinate multiple app teams at the same time. Also, an
> individual cluster can be tuned for its specific workload. Sometimes,
> though, costs and data size push us towards combining smaller apps owned by
> the same team onto a single cluster. Those are the exceptions.
>
>
>
> As a Cassandra admin, I am always trying to scale the ability to admin
> multiple clusters without just adding new admins. That is an on-going task,
> dependent on your operating environment.
>
>
>
> Also, because every table has a portion of memory (memtable), there is a
> practical limit to the number of tables that any one cluster should have. I
> have heard it is in the low hundreds of tables. This puts a limit on the
> number of applications that a cluster can safely support.
>
>
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
>
>
> *From:* Osman YOZGATLIOĞLU 
> 
> *Sent:* Wednesday, October 23, 2019 6:23 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] merge two cluster
>
>
>
> Hello,
>
> I have two cluster and both contains different data sets with different
> node counts.
>
> Would it be beneficial to merge two cluster?
>
>
>
> Regards,
>
> Osman
>
>
> --
>
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or 

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Jon Haddad
Oh, my bad.  There was a flood of information there, I didn't realize you
had switched to two DCs.  It's been a long day.

I'll be honest, it's really hard to read your various options as you've
intermixed terminology from AWS and Cassandra in a weird way and there's
several pages of information here to go through.  I don't have time to
decipher it, sorry.

Spread a DC across 3 AZs if you want to be fault tolerant and will use
RF=3, use a single AZ if you don't care about full DC failure in the case
of an AZ failure or you're not using RF=3.


On Wed, Oct 23, 2019 at 4:56 PM Sergio  wrote:

> OPTION C or OPTION A?
>
> Which one are you referring to?
>
> Both have separate DCs to keep the workload separate.
>
>- OPTION A)
>- Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>- 3 read ONE us-east-1a
>- 4 write TWO us-east-1b 5 write TWO us-east-1b
>- 6 write TWO us-east-1b
>
>
> Here we have 2 DC read and write
> One Rack per DC
> One Availability Zone per DC
>
> Thanks,
>
> Sergio
>
>
> On Wed, Oct 23, 2019, 1:11 PM Jon Haddad  wrote:
>
>> Personally, I wouldn't ever do this.  I recommend separate DCs if you
>> want to keep workloads separate.
>>
>> On Wed, Oct 23, 2019 at 4:06 PM Sergio  wrote:
>>
>>>   I forgot to comment for
>>>
>>>OPTION C)
>>>1. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>>>2. 3 read ONE us-east-1c
>>>3. 4 write TWO us-east-1a 5 write TWO us-east-1b
>>>4. 6 write TWO us-east-1c I would expect that I need to decrease the
>>>Consistency Level in the reads if one of the AZ goes down. Please 
>>> consider
>>>the below one as the real OPTION A. The previous one looks to be wrong
>>>because the same rack is assigned to 2 different DC.
>>>5. OPTION A)
>>>6. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>>>7. 3 read ONE us-east-1a
>>>8. 4 write TWO us-east-1b 5 write TWO us-east-1b
>>>9. 6 write TWO us-east-1b
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Sergio
>>>
>>> Il giorno mer 23 ott 2019 alle ore 12:33 Sergio <
>>> lapostadiser...@gmail.com> ha scritto:
>>>
>>>> Hi Reid,
>>>>
>>>> Thank you very much for clearing these concepts for me.
>>>> https://community.datastax.com/comments/1133/view.html I posted this
>>>> question on the datastax forum regarding our cluster that it is unbalanced
>>>> and the reply was related that the *number of racks should be a
>>>> multiplier of the replication factor *in order to be balanced or 1. I
>>>> thought then if I have 3 availability zones I should have 3 racks for each
>>>> datacenter and not 2 (us-east-1b, us-east-1a) as I have right now or in the
>>>> easiest way, I should have a rack for each datacenter.
>>>>
>>>>
>>>>
>>>>1. Datacenter: live
>>>>
>>>>Status=Up/Down
>>>>|/ State=Normal/Leaving/Joining/Moving
>>>>--  Address  Load   Tokens   OwnsHost ID
>>>>Rack
>>>>UN  10.1.20.49   289.75 GiB  256  ?
>>>>be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
>>>>UN  10.1.30.112  103.03 GiB  256  ?
>>>>e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
>>>>UN  10.1.19.163  129.61 GiB  256  ?
>>>>3c2efdda-8dd4-4f08-b991-9aff062a5388  us-east-1a
>>>>UN  10.1.26.181  145.28 GiB  256  ?
>>>>0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
>>>>UN  10.1.17.213  149.04 GiB  256  ?
>>>>71563e86-b2ae-4d2c-91c5-49aa08386f67  us-east-1a
>>>>DN  10.1.19.198  52.41 GiB  256  ?
>>>>613b43c0-0688-4b86-994c-dc772b6fb8d2  us-east-1b
>>>>UN  10.1.31.60   195.17 GiB  256  ?
>>>>3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
>>>>UN  10.1.25.206  100.67 GiB  256  ?
>>>>f43532ad-7d2e-4480-a9ce-2529b47f823d  us-east-1b
>>>>So each rack label right now matches the availability zone and we
>>>>have 3 Datacenters and 2 Availability Zone with 2 racks per DC but the
>>>>above is clearly unbalanced
>>>>If I have a keyspace with a replication factor = 3 and I want to
>>>>minimize the number of nodes to scale up and down the cluster and keep 
>&

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Jon Haddad
Personally, I wouldn't ever do this.  I recommend separate DCs if you want
to keep workloads separate.

On Wed, Oct 23, 2019 at 4:06 PM Sergio  wrote:

>   I forgot to comment for
>
>OPTION C)
>1. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>2. 3 read ONE us-east-1c
>3. 4 write TWO us-east-1a 5 write TWO us-east-1b
>4. 6 write TWO us-east-1c I would expect that I need to decrease the
>Consistency Level in the reads if one of the AZ goes down. Please consider
>the below one as the real OPTION A. The previous one looks to be wrong
>because the same rack is assigned to 2 different DC.
>5. OPTION A)
>6. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>7. 3 read ONE us-east-1a
>8. 4 write TWO us-east-1b 5 write TWO us-east-1b
>9. 6 write TWO us-east-1b
>
>
>
> Thanks,
>
> Sergio
>
> Il giorno mer 23 ott 2019 alle ore 12:33 Sergio 
> ha scritto:
>
>> Hi Reid,
>>
>> Thank you very much for clearing these concepts for me.
>> https://community.datastax.com/comments/1133/view.html I posted this
>> question on the datastax forum regarding our cluster that it is unbalanced
>> and the reply was related that the *number of racks should be a
>> multiplier of the replication factor *in order to be balanced or 1. I
>> thought then if I have 3 availability zones I should have 3 racks for each
>> datacenter and not 2 (us-east-1b, us-east-1a) as I have right now or in the
>> easiest way, I should have a rack for each datacenter.
>>
>>
>>
>>1. Datacenter: live
>>
>>Status=Up/Down
>>|/ State=Normal/Leaving/Joining/Moving
>>--  Address  Load   Tokens   OwnsHost ID
>>  Rack
>>UN  10.1.20.49   289.75 GiB  256  ?
>>be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
>>UN  10.1.30.112  103.03 GiB  256  ?
>>e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
>>UN  10.1.19.163  129.61 GiB  256  ?
>>3c2efdda-8dd4-4f08-b991-9aff062a5388  us-east-1a
>>UN  10.1.26.181  145.28 GiB  256  ?
>>0a8f07ba-a129-42b0-b73a-df649bd076ef  us-east-1b
>>UN  10.1.17.213  149.04 GiB  256  ?
>>71563e86-b2ae-4d2c-91c5-49aa08386f67  us-east-1a
>>DN  10.1.19.198  52.41 GiB  256  ?
>>613b43c0-0688-4b86-994c-dc772b6fb8d2  us-east-1b
>>UN  10.1.31.60   195.17 GiB  256  ?
>>3647fcca-688a-4851-ab15-df36819910f4  us-east-1b
>>UN  10.1.25.206  100.67 GiB  256  ?
>>f43532ad-7d2e-4480-a9ce-2529b47f823d  us-east-1b
>>So each rack label right now matches the availability zone and we
>>have 3 Datacenters and 2 Availability Zone with 2 racks per DC but the
>>above is clearly unbalanced
>>If I have a keyspace with a replication factor = 3 and I want to
>>minimize the number of nodes to scale up and down the cluster and keep it
>>balanced should I consider an approach like OPTION A)
>>2. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>>3. 3 read ONE us-east-1a
>>4. 4 write ONE us-east-1b 5 write ONE us-east-1b
>>5. 6 write ONE us-east-1b
>>6. OPTION B)
>>7. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1a
>>8. 3 read ONE us-east-1a
>>9. 4 write TWO us-east-1b 5 write TWO us-east-1b
>>10. 6 write TWO us-east-1b
>>11. *7 read ONE us-east-1c 8 write TWO us-east-1c*
>>12. *9 read ONE us-east-1c* Option B looks to be unbalanced and I
>>would exclude it OPTION C)
>>13. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>>14. 3 read ONE us-east-1c
>>15. 4 write TWO us-east-1a 5 write TWO us-east-1b
>>16. 6 write TWO us-east-1c
>>17.
>>
>>
>>so I am thinking of A if I have the restriction of 2 AZ but I guess
>>that option C would be the best. If I have to add another DC for reads
>>because we want to assign a new DC for each new microservice it would look
>>like:
>>   OPTION EXTRA DC For Reads
>>   1. Node DC RACK AZ 1 read ONE us-east-1a 2 read ONE us-east-1b
>>   2. 3 read ONE us-east-1c
>>   3. 4 write TWO us-east-1a 5 write TWO us-east-1b
>>   4. 6 write TWO us-east-1c 7 extra-read THREE us-east-1a
>>   5. 8 extra-read THREE us-east-1b
>>   6.
>>  7.
>>
>>
>>1. 9 extra-read THREE us-east-1c
>>   2.
>>The DC for *write* will replicate the data in the other datacenters.
>>My scope is to keep the *read* machines dedicated to serve reads and
>>*write* machines to serve writes. Cassandra will handle the
>>replication for me. Is there any other option that is I missing or wrong
>>assumption? I am thinking that I will write a blog post about all my
>>learnings so far, thank you very much for the replies Best, Sergio
>>
>>
>> Il giorno mer 23 ott 2019 alle ore 10:57 Reid Pinchback <
>> rpinchb...@tripadvisor.com> ha scritto:
>>
>>> No, that’s not correct.  The point of racks is to help you distribute

Re: Cassandra 2.1.18 - Question on stream/bootstrap throughput

2019-10-22 Thread Jon Haddad
CPU waiting on memory will look like CPU overhead.   There's a good post on
the topic by Brendan Gregg:
http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html

Regarding GC, I agree with Reid.  You're probably not going to saturate
your network card no matter what your settings, Cassandra has way too much
overhead to do that.  It's one of the reasons why the whole zero-copy
streaming feature was added to Cassandra 4.0:
http://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html

Reid is also correct in pointing out the method by which you're monitoring
your metrics might be problematic.  With prometheus, the same data can show
significantly different graphs when using rate vs irate, and only
collecting once a minute would hide a lot of useful data.

If you keep digging and find you're not using all your CPU during GC
pauses, you can try using more GC threads by setting -XX:ParallelGCThreads
to match the number of cores you have, since by default it won't use them
all.  You've got 40 cores in the m4.10xlarge, try
setting -XX:ParallelGCThreads to 40.

Jon



On Tue, Oct 22, 2019 at 11:38 AM Reid Pinchback 
wrote:

> Thomas, what is your frequency of metric collection?  If it is
> minute-level granularity, that can give a very false impression.  I’ve seen
> CPU and disk throttles that don’t even begin to show visibility until
> second-level granularity around the time of the constraining event.  Even
> clearer is 100ms.
>
>
>
> Also, are you monitoring your GC activity at all?  GC bound up in a lot of
> memory copies is not going to manifest that much CPU, it’s memory bus
> bandwidth you are fighting against then.  It is easy to have a box that
> looks unused but in reality its struggling.  Given that you’ve opened up
> the floodgates on compaction, that would seem quite plausible to be what
> you are experiencing.
>
>
>
> *From: *"Steinmaurer, Thomas" 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Tuesday, October 22, 2019 at 11:22 AM
> *To: *"user@cassandra.apache.org" 
> *Subject: *RE: Cassandra 2.1.18 - Question on stream/bootstrap throughput
>
>
>
> *Message from External Sender*
>
> Hi Alex,
>
>
>
> Increased streaming throughput has been set on the existing nodes only,
> cause it is meant to limit outgoing traffic only, right? At least when
> judging from the name, reading the documentation etc.
>
>
>
> Increased compaction throughput on all nodes, although my understanding is
> that it would be necessary only on the joining node to catchup with
> compacting received SSTables.
>
>
>
> We really see no resource (CPU, NW and disk) being somehow maxed out on
> any node, which would explain the limit in the area of the new node
> receiving data at ~ 180-200 Mbit/s.
>
>
>
> Thanks again,
>
> Thomas
>
>
>
> *From:* Oleksandr Shulgin 
> *Sent:* Dienstag, 22. Oktober 2019 16:35
> *To:* User 
> *Subject:* Re: Cassandra 2.1.18 - Question on stream/bootstrap throughput
>
>
>
> On Tue, Oct 22, 2019 at 12:47 PM Steinmaurer, Thomas <
> thomas.steinmau...@dynatrace.com> wrote:
>
>
>
> using 2.1.8, 3 nodes (m4.10xlarge, ESB SSD-based), vnodes=256, RF=3, we
> are trying to add a 4th node.
>
>
>
> The two options to my knowledge, mainly affecting throughput, namely
> stream output and compaction throttling has been set to very high values
> (e.g. stream output = 800 Mbit/s resp. compaction throughput = 500 Mbyte/s)
> or even set to 0 (unthrottled) in cassandra.yaml + process restart. In both
> scenarios (throttling with high values vs. unthrottled), the 4th node is
> streaming from one node capped ~ 180-200Mbit/s, according to our SFM.
>
>
>
> The nodes have plenty of resources available (10Gbit, disk io/iops), also
> confirmed by e.g. iperf in regard to NW throughput and write to / read from
> disk in the area of 200 MByte/s.
>
>
>
> Are there any other known throughput / bootstrap limitations, which
> basically outrule above settings?
>
>
>
> Hi Thomas,
>
>
>
> Assuming you have 3 Availability Zones and you are adding the new node to
> one of the zones where you already have a node running, it is expected that
> it only streams from that node (its local rack).
>
>
>
> Have you increased the streaming throughput on the node it streams from or
> only on the new node?  The limit applies to the source node as well.  You
> can change it online w/o the need to restart using nodetool command.
>
>
>
> Have you checked if the new node is not CPU-bound?  It's unlikely though
> due to big instance type and only one node to stream from, more relevant
> for scenarios when streaming from a lot of nodes.
>
>
>
> Cheers,
>
> --
>
> Alex
>
>
>
> The contents of this e-mail are intended for the named addressee only. It
> contains information that may be confidential. Unless you are the named
> addressee or an authorized designee, you may not copy or use it, or
> disclose it to anyone else. If you received it in error please notify us
> immediately and then destroy it. Dynatrace Austria GmbH 

Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Jon Haddad
tlp-stress comes with workloads pre-baked, so there's not much
configuration to do.  The main flags you'll want are going to be:

-d : duration, I highly recommend running your test for a few days
--compaction
--compression
-p: number of partitions
-r: % of reads, 0-1

For example, you might run:

tlp-stress run KeyValue -d 24h --compaction lcs -p 10m -r .9

for a basic key value table, running for 24 hours, using LCS, 10 million
partitions, 90% reads.

There's a lot of options. I won't list them all here, it's why I wrote the
manual :)

Jon


On Mon, Oct 21, 2019 at 1:16 PM Sergio  wrote:

> Thanks, guys!
> I just copied and paste what I found on our test machines but I can
> confirm that we have the same settings except for 8GB in production.
> I didn't select these settings and I need to verify why these settings are
> there.
> If any of you want to share your flags for a read-heavy workload it would
> be appreciated, so I would replace and test those flags with TLP-STRESS.
> I am thinking about different approaches (G1GC vs ParNew + CMS)
> How many GB for RAM do you dedicate to the OS in percentage or in an exact
> number?
> Can you share the flags for ParNew + CMS that I can play with it and
> perform a test?
>
> Best,
> Sergio
>
>
> Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback <
> rpinchb...@tripadvisor.com> ha scritto:
>
>> Since the instance size is < 32gb, hopefully swap isn’t being used, so it
>> should be moot.
>>
>>
>>
>> Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably
>> doesn’t do anything for you.  I believe that only applies to CMS, not
>> G1GC.  I also wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good
>> thing on AWS (or anything virtualized), you’d have to run your own tests
>> and find out.
>>
>>
>>
>> R
>>
>> *From: *Jon Haddad 
>> *Reply-To: *"user@cassandra.apache.org" 
>> *Date: *Monday, October 21, 2019 at 12:06 PM
>> *To: *"user@cassandra.apache.org" 
>> *Subject: *Re: [EXTERNAL] Re: GC Tuning
>> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>>
>>
>>
>> *Message from External Sender*
>>
>> One thing to note, if you're going to use a big heap, cap it at 31GB, not
>> 32.  Once you go to 32GB, you don't get to use compressed pointers [1], so
>> you get less addressable space than at 31GB.
>>
>>
>>
>> [1]
>> https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.codecentric.de_en_2014_02_35gb-2Dheap-2Dless-2D32gb-2Djava-2Djvm-2Dmemory-2Doddities_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=e9Ahs5XXRBicgUhMZQaboxsqb6jXpjvo48kEojUWaQc=Q7jI4ZEqVMFZIMPoSXTvMebG5fWOUJ6lhDOgWGxiHg8=>
>>
>>
>>
>> On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R <
>> sean_r_dur...@homedepot.com> wrote:
>>
>> I don’t disagree with Jon, who has all kinds of performance tuning
>> experience. But for ease of operation, we only use G1GC (on Java 8),
>> because the tuning of ParNew+CMS requires a high degree of knowledge and
>> very repeatable testing harnesses. It isn’t worth our time. As a previous
>> writer mentioned, there is usually better return on our time tuning the
>> schema (aka helping developers understand Cassandra’s strengths).
>>
>>
>>
>> We use 16 – 32 GB heaps, nothing smaller than that.
>>
>>
>>
>> Sean Durity
>>
>>
>>
>> *From:* Jon Haddad 
>> *Sent:* Monday, October 21, 2019 10:43 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* [EXTERNAL] Re: GC Tuning
>> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2018_04_11_gc-2Dtuning.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=e9Ahs5XXRBicgUhMZQaboxsqb6jXpjvo48kEojUWaQc=YFRUQ6Rdb5mcFf6GqguRYCsrcAcP6KzjozIgYp56riE=>
>>
>>
>>
>> I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
>> comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
>> it is, but I like to verify first.  The pause times with ParNew + CMS are
>> generally lower than G1 when tuned right, but as Chris said it can be
>> tricky.  If you aren't willing to spend the time understanding how it works
>> and why each setting matters, G1 is a better option.
>>
>>
>>
>> I wouldn't run Cassandra in production on less than 8GB of heap - I
>> consider it the absol

Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Jon Haddad
One thing to note, if you're going to use a big heap, cap it at 31GB, not
32.  Once you go to 32GB, you don't get to use compressed pointers [1], so
you get less addressable space than at 31GB.

[1]
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/

On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R 
wrote:

> I don’t disagree with Jon, who has all kinds of performance tuning
> experience. But for ease of operation, we only use G1GC (on Java 8),
> because the tuning of ParNew+CMS requires a high degree of knowledge and
> very repeatable testing harnesses. It isn’t worth our time. As a previous
> writer mentioned, there is usually better return on our time tuning the
> schema (aka helping developers understand Cassandra’s strengths).
>
>
>
> We use 16 – 32 GB heaps, nothing smaller than that.
>
>
>
> Sean Durity
>
>
>
> *From:* Jon Haddad 
> *Sent:* Monday, October 21, 2019 10:43 AM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] Re: GC Tuning
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>
> I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
> comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
> it is, but I like to verify first.  The pause times with ParNew + CMS are
> generally lower than G1 when tuned right, but as Chris said it can be
> tricky.  If you aren't willing to spend the time understanding how it works
> and why each setting matters, G1 is a better option.
>
>
>
> I wouldn't run Cassandra in production on less than 8GB of heap - I
> consider it the absolute minimum.  For G1 I'd use 16GB, and never 4GB with
> Cassandra unless you're rarely querying it.
>
>
>
> I typically use the following as a starting point now:
>
>
>
> ParNew + CMS
>
> 16GB heap
>
> 10GB new gen
>
> 2GB memtable cap, otherwise you'll spend a bunch of time copying around
> memtables (cassandra.yaml)
>
> Max tenuring threshold: 2
>
> survivor ratio 6
>
>
>
> I've also done some tests with a 30GB heap, 24 GB of which was new gen.
> This worked surprisingly well in my tests since it essentially keeps
> everything out of the old gen.  New gen allocations are just a pointer bump
> and are pretty fast, so in my (limited) tests of this I was seeing really
> good p99 times.  I was seeing a 200-400 ms pause roughly once a minute
> running a workload that deliberately wasn't hitting a resource limit
> (testing real world looking stress vs overwhelming the cluster).
>
>
>
> We built tlp-cluster [1] and tlp-stress [2] to help figure these things
> out.
>
>
>
> [1] https://thelastpickle.com/tlp-cluster/ [thelastpickle.com]
> <https://urldefense.com/v3/__https:/thelastpickle.com/tlp-cluster/__;!OYIaWQQGbnA!ZhiXAdRaL49J8nBlh0F_5MQ97Z1QNTUuTSMvksmEmxan3d65D6ATmQO1ig58W52u_EmQ1GM$>
>
> [2] http://thelastpickle.com/tlp-stress [thelastpickle.com]
> <https://urldefense.com/v3/__http:/thelastpickle.com/tlp-stress__;!OYIaWQQGbnA!ZhiXAdRaL49J8nBlh0F_5MQ97Z1QNTUuTSMvksmEmxan3d65D6ATmQO1ig58W52uuCUZYKw$>
>
>
>
> Jon
>
>
>
>
>
>
>
>
>
> On Mon, Oct 21, 2019 at 10:24 AM Reid Pinchback <
> rpinchb...@tripadvisor.com> wrote:
>
> An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So
> minus room for other uses of jvm memory and for kernel activity, that’s
> about 25 gb for file cache.  You’ll have to see if you either want a bigger
> heap to allow for less frequent gc cycles, or you could save money on the
> instance size.  C* generates a lot of medium-length lifetime objects which
> can easily end up in old gen.  A larger heap will reduce the burn of more
> old-gen collections.  There are no magic numbers to just give because it’ll
> depend on your usage patterns.
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Sunday, October 20, 2019 at 2:51 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: GC Tuning 
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
> [thelastpickle.com]
> <https://urldefense.com/v3/__https:/thelastpickle.com/blog/2018/04/11/gc-tuning.html__;!OYIaWQQGbnA!ZhiXAdRaL49J8nBlh0F_5MQ97Z1QNTUuTSMvksmEmxan3d65D6ATmQO1ig58W52uwG_KUYM$>
>
>
>
> *Message from External Sender*
>
> Thanks for the answer.
>
> This is the JVM version that I have right now.
>
> openjdk version "1.8.0_161"
> OpenJDK Runtime Environment (build 1.8.0_161-b14)
> OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>
> These are the current flags. Would you change anything in a i3x.large aws
> node?
>
> java -Xloggc:/var/log/cassandra/gc.log
> -Dcas

Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Jon Haddad
I still use ParNew + CMS over G1GC with Java 8.  I haven't done a
comparison with JDK 11 yet, so I'm not sure if it's any better.  I've heard
it is, but I like to verify first.  The pause times with ParNew + CMS are
generally lower than G1 when tuned right, but as Chris said it can be
tricky.  If you aren't willing to spend the time understanding how it works
and why each setting matters, G1 is a better option.

I wouldn't run Cassandra in production on less than 8GB of heap - I
consider it the absolute minimum.  For G1 I'd use 16GB, and never 4GB with
Cassandra unless you're rarely querying it.

I typically use the following as a starting point now:

ParNew + CMS
16GB heap
10GB new gen
2GB memtable cap, otherwise you'll spend a bunch of time copying around
memtables (cassandra.yaml)
Max tenuring threshold: 2
survivor ratio 6

I've also done some tests with a 30GB heap, 24 GB of which was new gen.
This worked surprisingly well in my tests since it essentially keeps
everything out of the old gen.  New gen allocations are just a pointer bump
and are pretty fast, so in my (limited) tests of this I was seeing really
good p99 times.  I was seeing a 200-400 ms pause roughly once a minute
running a workload that deliberately wasn't hitting a resource limit
(testing real world looking stress vs overwhelming the cluster).

We built tlp-cluster [1] and tlp-stress [2] to help figure these things
out.

[1] https://thelastpickle.com/tlp-cluster/
[2] http://thelastpickle.com/tlp-stress

Jon




On Mon, Oct 21, 2019 at 10:24 AM Reid Pinchback 
wrote:

> An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So
> minus room for other uses of jvm memory and for kernel activity, that’s
> about 25 gb for file cache.  You’ll have to see if you either want a bigger
> heap to allow for less frequent gc cycles, or you could save money on the
> instance size.  C* generates a lot of medium-length lifetime objects which
> can easily end up in old gen.  A larger heap will reduce the burn of more
> old-gen collections.  There are no magic numbers to just give because it’ll
> depend on your usage patterns.
>
>
>
> *From: *Sergio 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Sunday, October 20, 2019 at 2:51 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Re: GC Tuning
> https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>
> *Message from External Sender*
>
> Thanks for the answer.
>
> This is the JVM version that I have right now.
>
> openjdk version "1.8.0_161"
> OpenJDK Runtime Environment (build 1.8.0_161-b14)
> OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)
>
> These are the current flags. Would you change anything in a i3x.large aws
> node?
>
> java -Xloggc:/var/log/cassandra/gc.log
> -Dcassandra.max_queued_native_transport_requests=4096 -ea
> -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
> -XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103
> -XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB
> -XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true
> -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:+UseG1GC
> -XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200
> -XX:InitiatingHeapOccupancyPercent=45 -XX:G1HeapRegionSize=0
> -XX:-ParallelRefProcEnabled -Xms3821M -Xmx3821M
> -XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler
> -Dcom.sun.management.jmxremote.port=7199
> -Dcom.sun.management.jmxremote.rmi.port=7199
> -Dcom.sun.management.jmxremote.ssl=false
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.password.file=/etc/cassandra/conf/jmxremote.password
> -Dcom.sun.management.jmxremote.access.file=/etc/cassandra/conf/jmxremote.access
> -Djava.library.path=/usr/share/cassandra/lib/sigar-bin
> -Djava.rmi.server.hostname=172.24.150.141 -XX:+CMSClassUnloadingEnabled
> -javaagent:/usr/share/cassandra/lib/jmx_prometheus_javaagent-0.3.1.jar=10100:/etc/cassandra/default.conf/jmx-export.yml
> -Dlogback.configurationFile=logback.xml
> -Dcassandra.logdir=/var/log/cassandra -Dcassandra.storagedir=
> -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid
> -Dcassandra-foreground=yes -cp
> 

Re: Elevated response times from all nodes in a data center at the same time.

2019-10-16 Thread Jon Haddad
It's possible the queries you're normally running are served out of page
cache, and during the latency spike you're hitting your disks. If you're
using read ahead you might be hitting a throughput limit on the disks.

I've got some numbers and graphs I can share later when I'm not on my
phone.

Jon

On Wed, Oct 16, 2019, 3:03 AM Alain RODRIGUEZ  wrote:

> Hello Bill,
>
> I think it might be worth it to focus the effort on diagnosing the issue
> properly in the first place, thus I'll try to guide you through this.
>
> First some comments on your environment:
>
> AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in
>> each region.
>> No of Nodes: 24
>> Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for
>> APIs and 2 OLAP Data centers for Analytics and Batch loads)
>> Instance Types: r5.8x Large
>> Average Node Size: 182 GB
>> Work Load: Read heavy
>>
>
> When I read this, I think 'Tune the garbage collection properly!'. do you
> see GC pauses being a problem? The easiest way to interpret GC logs is
> probably to upload them there: https://gceasy.io. Ensure there that the
> 'GC throughput' is at least around 95+% (ideally 98+%). This would mean
> that 2 to 5% of the time each node stops to perform stop the world GCs. If
> that's a thing, we can help you setting GC options a bit nicer than what it
> is currently probably. That post would then probably be a good starting
> point: https://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>
>
>> Read TPS: 22k
>> Cassandra version: 3.0.15
>>
>
> Reading this, I'd recommend an upgrade to the 3.0.latest (3.0.18 at the
> time being)  or (personal preference) 3.11.4. There was a bunch of fixes,
> maybe are you hitting something that was fixed already, check changes
> there, see if any change could do some good for your use case:
> https://github.com/apache/cassandra/blob/cassandra-3.0.18/CHANGES.txt#L1-L124
>
> Java Version: JDK 181.
>> EBS Volumes: GP2 with 1TB 3000 iops.
>>
>
> The GP2 IOPS depends on the dis size. If you find out at anytime that
> disks are not keeping up, a good way out could be to increase the disk size
> (despite the small dataset) to actually increase the disk. IOPS &
> throughput. Now this did not recently change, and it was working for you
> until now, thus you don't have to Increase the disk size now probably. Just
> be aware that GP2 with 1 TB are quite slow.
>
> About the issue:
>
> our p99 latency in our AWS us-east-1 region OLTP data center, suddenly
>> starts rising from 2 ms to 200 ms. It starts with one node where we see the
>> 99th percentile Read Request latency in Datastax Opscenter starts
>> increasing. And it spreads immediately, to all other 6 nodes in the data
>> center.
>>
>
> Well, this sounds quite bad. The first 2 things coming to my mind here are:
> - Are you reading tombstones? (check logs for tombstones, trace a few
> queries)
> - Are you reading a huge partition? (check the max partition size, compare
> it to the mean and ensure it is remaining below 1 GB (or even 100 MB
> ideally).
>
> An inefficient read, for the reasons above or other reasons, would not
> necessarily impact nodes' resources but could definitely destroy
> performances for this query and the following ones due to the 'requests
> congestion'.
>
> To try to make a sense of the current tombstones level you can look at:
> - logs (grep tombstone)
> - sstablemetadata gives you the % of droppable tombstones. This is an
> estimate and of the space that could be freed, it gives no information on
> whether tombstones are being read and can affect performances or not, yet
> it gives you an idea of the tombstones that can be generated in the workflow
> - Trace queries: either trace a manual query from cqlsh with 'TRACING ON;'
> then sending queries similar to prod ones. Or directly using 'nodetool
> settraceprobability X', /!\ ensure X is really low to start with - like
> 0.001 or 0.0001 maybe, we probably don't need many queries to understand
> what happened and tracing might inflicts a big penalty to Cassandra servers
> in terms of performances (each of the traced queries will induce a bunch of
> queries to actually persist the trace In system_traces key space.
>
> We do not see any Read request timeouts or Exception in the our API Splunk
>> logs only p99 and average latency go up suddenly.
>>
>
> What's the value you use for timeouts? Also, any other exception/timeout,
> somewhere else than for reads?
> What are the result of:
>
> - nodetool tablestats (I think this would gather what you need to check
> --> nodetool tablestats | grep -e Keyspace -e Table: -e latency -e
> partition -e tombstones)
> - watch -d nodetool tpstats (here look at any pending threads constantly
> higher than 0, any blocked or dropped threads)
>
> We have investigated CPU level usage, Disk I/O, Memory usage and Network
>> parameters for the nodes during this period and we are not experiencing any
>> sudden surge in these parameters.
>
>
> If the 

Re: cluster rolling restart

2019-10-16 Thread Jon Haddad
I agree with Jeff here. Ideally you should be so comfortable with rolling
restarts that they become second nature. Cassandra is designed to handle
them and you should not be afraid to do them regularly.

On Wed, Oct 16, 2019, 8:06 AM Jeff Jirsa  wrote:

>
> Personally I encourage you to rolling restart from time to time, use it as
> an opportunity to upgrade kernels and JDKs and cassandra itself and just
> generally make sure things are healthy and working how you expect
>
> If you see latencies jump or timeouts when you’re bouncing, that’s a
> warning and you know you need to address it - doing this in advance gives
> you a chance to do it while the bounce is optional and can be paused. If
> you wait for a switch to fail or AWS AZ to crash, you may have problems
> lurking you don’t know about until it’s too late.
>
> - Jeff
>
> > On Oct 16, 2019, at 12:56 AM, Marco Gasparini <
> marco.gaspar...@competitoor.com> wrote:
> >
> > 
> > hi all,
> >
> > I was wondering if it is recommended to perform a rolling restart of the
> cluster once in a while.
> > Is it a good practice or necessary? how often?
> >
> > Thanks
> > Marco
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Update/where statement Adds Row

2019-09-12 Thread Jon Haddad
Probably not a great idea unless you're using it sparingly. Using LWTs
without knowing all the caveats is likely to lead to terrible cluster
performance.




On Wed, Sep 11, 2019, 10:59 PM A  wrote:

> Is it ok if I do this?
>
> ... where email = em AND company_id = id IF EXISTS
>
>
>
>
>
> Sent from Yahoo Mail for iPhone
> 
>
> On Wednesday, September 11, 2019, 9:08 PM, JOHN, BIBIN 
> wrote:
>
> Use if exists clause.
>
>
>
> *UPDATE* table
>
> *SET* column ='something'
>
> WHERE key = ‘value’ IF EXISTS;
>
>
>
>
>
>
>
>
>
>
>
> *From:* A 
> *Sent:* Wednesday, September 11, 2019 11:05 PM
> *To:* User cassandra.apache.org 
> *Subject:* Update/where statement Adds Row
>
>
>
> I have an update statement that has a where clause with the primary key
> (email,companyid).
>
>
>
> When executed it always creates a new row. It’s like it’s not finding the
> existing row with the primary key.
>
>
>
> I’m using Cassandra-driver.
>
>
>
> What am I doing wrong? I don’t want a new row. Why doesn’t it seem to be
> using the where clause to identify the existing row?
>
>
>
> Thanks,
>
> Angel
>
>
>
>
> Sent from Yahoo Mail for iPhone
> 
>
>


Re: Is it possible to build multi cloud cluster for Cassandra

2019-09-05 Thread Jon Haddad
Technically, not a problem.  Use GossipingPropertyFileSnitch to keep things
simple and you can go across whatever cloud providers you want without
issue.

The biggest issue you're going to have isn't going to be Cassandra, it's
having the expertise in the different cloud providers to understand their
strengths and weaknesses.  You'll want to benchmark every resource, and
properly sizing your instances to C* is now 2x (or 3x for 3 cloud
providers) the work.

I recommend using Terraform to make provisioning a bit easier.

On Thu, Sep 5, 2019 at 9:36 AM Goutham reddy 
wrote:

> Hello,
> Is it wise and advisable to build multi cloud environment for Cassandra
> for High Availability.
> AWS as one datacenter and Azure as another datacenter.
> If yes are there any challenges involved?
>
> Thanks and regards,
> Goutham.
>


Re: New column

2019-08-22 Thread Jon Haddad
Just to close the loop on this, I did a release of tlp-stress last night,
which now has this workload (AllowFiltering).  You can grab a deb, rpm,
tarball or docker image.

Docs are here: http://thelastpickle.com/tlp-stress/

Jon

On Mon, Aug 19, 2019 at 2:21 PM Jon Haddad  wrote:

> It'll be about the same overhead as selecting the entire partition, since
> that's essentially what you're doing.
>
> I created a tlp-stress workload this morning but haven't merged it into
> master yet.  I need to do a little cleanup and I might tweak it a little,
> but if you're feeling adventurous you can build the branch yourself:
> https://github.com/thelastpickle/tlp-stress/tree/jon/106-allow-filtering-workload
>
> Once you do an in place build (./gradlew shadowJar), you'll probably want
> to do something like the following:
>
> bin/tlp-stress run AllowFiltering -p 1k -d 1h -r .5 --populate 1m
> --field.allow_filtering.payload='random(100,200)' --compaction lcs
>
> That's running against C* on my laptop.  Here's what all those arguments
> do:
>
> -p 1k # 1000 partitions
> -d 1h # run for 1 hour (-d = duration)
> -r .5  # (50% reads)
> --populate 1m # (pre populate with 1 million rows)
> --field.allow_filtering.payload='random(100,200)'  # use 100 - 200 bytes
> for the payload.  I assume there will be other data other than just the
> record, this will let you size each row accordingly
> --compaction lcs # use leveled compaction
>
> You can tweak the params as needed.  If you've got a cluster up, use the
> --host to point to it.If you don't have a cluster up, you can spin one
> up in AWS in about 5-10 minutes using our tools:
> https://thelastpickle.com/tlp-cluster/
>
> Happy testing!
> Jon
>
>
> On Mon, Aug 19, 2019 at 1:23 PM Rahul Reddy 
> wrote:
>
>> Jon,
>>
>> If we expect non of  our partition key to have more than 100 records and
>> pass partition key in where clause we wouldnt see issues using new column
>> and allow filtering?  Can you please point me to any doc how allow
>> filtering works. I was in assumption of it goes through all the partitions
>>
>>
>> On Sun, Aug 18, 2019, 4:33 PM Jon Haddad  wrote:
>>
>>> If you're giving the partition key you won't scan the whole table. The
>>> overhead will depend on the size or the partition.
>>>
>>> Would be an interesting workload for our tlp-stress tool, I'll code
>>> something up for the next release.
>>>
>>> On Sun, Aug 18, 2019, 12:58 PM Rahul Reddy 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> We have a table and want to add column and select based on existing
>>>> entire primary key plus new column using allow filtering. Since my where
>>>> clause has all the primary key + new column does the allow filtering scan
>>>> only the partions which are listed or does it has to scan whole table? What
>>>> is the best approach add new column and query it based on existing primary
>>>> key plus new column?
>>>>
>>>


Re: Disk space utilization by from some Cassandra

2019-08-21 Thread Jon Haddad
This advice hasn't been valid for a long time now for most use cases.  The
only time you need to reserve 50% disk space is if you're going to be
running major compactions against a table in your cluster that occupies 50%
of its total disk space.  Nowadays, that's far less common than it was when
your only option was STCS and majors to clean out your tombstones and
tombstoned data.

Even if you need to run majors regularly, if the tables you're running them
on are using 10% of your disk space you only need a little over 10% free to
run it.

If you're using TWCS and TTLs, you can run 80-90% disk usage and be fine in
a lot of cases.

The blanket advice of 50% free needs to die in a fire, it's flat out wrong
*as a rule* and expensive.  There are cases in which it is valid, but there
are far fewer valid applications of it than not.

Jon



On Wed, Aug 21, 2019 at 6:53 AM Stefan Miklosovic <
stefan.mikloso...@instaclustr.com> wrote:

> Hi,
>
> for example compaction uses a lot of disk space. It is quite common so
> it is not safe to have your disk utilised like on 85% because
> compactions would not have room to comapact and that node would be
> stuck. This happens in production quite often.
>
> Hence, having it on 50% and having big buffer to do compaction is good
> idea. If it is all compacted, it should go back to normal under 50%
> (or what figure you have).
>
> On Wed, 21 Aug 2019 at 14:33,  wrote:
> >
> > Good day,
> >
> >
> >
> > I’m running the monitoring script for disk space utilization set the
> benchmark to 50%. Currently am getting the alerts from some of the nodes
> >
> > About disk space greater than 50%.
> >
> >
> >
> > Is there a way, I can quickly figure out why the space has increased and
> how I can maintain the disk space used by Cassandra to be below the
> benchmark at all the times.
> >
> >
> >
> > Any ideas would be much appreciated.
> >
> >
> >
> > Sent from Mail for Windows 10
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: New column

2019-08-19 Thread Jon Haddad
It'll be about the same overhead as selecting the entire partition, since
that's essentially what you're doing.

I created a tlp-stress workload this morning but haven't merged it into
master yet.  I need to do a little cleanup and I might tweak it a little,
but if you're feeling adventurous you can build the branch yourself:
https://github.com/thelastpickle/tlp-stress/tree/jon/106-allow-filtering-workload

Once you do an in place build (./gradlew shadowJar), you'll probably want
to do something like the following:

bin/tlp-stress run AllowFiltering -p 1k -d 1h -r .5 --populate 1m
--field.allow_filtering.payload='random(100,200)' --compaction lcs

That's running against C* on my laptop.  Here's what all those arguments do:

-p 1k # 1000 partitions
-d 1h # run for 1 hour (-d = duration)
-r .5  # (50% reads)
--populate 1m # (pre populate with 1 million rows)
--field.allow_filtering.payload='random(100,200)'  # use 100 - 200 bytes
for the payload.  I assume there will be other data other than just the
record, this will let you size each row accordingly
--compaction lcs # use leveled compaction

You can tweak the params as needed.  If you've got a cluster up, use the
--host to point to it.If you don't have a cluster up, you can spin one
up in AWS in about 5-10 minutes using our tools:
https://thelastpickle.com/tlp-cluster/

Happy testing!
Jon


On Mon, Aug 19, 2019 at 1:23 PM Rahul Reddy 
wrote:

> Jon,
>
> If we expect non of  our partition key to have more than 100 records and
> pass partition key in where clause we wouldnt see issues using new column
> and allow filtering?  Can you please point me to any doc how allow
> filtering works. I was in assumption of it goes through all the partitions
>
>
> On Sun, Aug 18, 2019, 4:33 PM Jon Haddad  wrote:
>
>> If you're giving the partition key you won't scan the whole table. The
>> overhead will depend on the size or the partition.
>>
>> Would be an interesting workload for our tlp-stress tool, I'll code
>> something up for the next release.
>>
>> On Sun, Aug 18, 2019, 12:58 PM Rahul Reddy 
>> wrote:
>>
>>> Hello,
>>>
>>> We have a table and want to add column and select based on existing
>>> entire primary key plus new column using allow filtering. Since my where
>>> clause has all the primary key + new column does the allow filtering scan
>>> only the partions which are listed or does it has to scan whole table? What
>>> is the best approach add new column and query it based on existing primary
>>> key plus new column?
>>>
>>


Re: New column

2019-08-18 Thread Jon Haddad
If you're giving the partition key you won't scan the whole table. The
overhead will depend on the size or the partition.

Would be an interesting workload for our tlp-stress tool, I'll code
something up for the next release.

On Sun, Aug 18, 2019, 12:58 PM Rahul Reddy  wrote:

> Hello,
>
> We have a table and want to add column and select based on existing entire
> primary key plus new column using allow filtering. Since my where clause
> has all the primary key + new column does the allow filtering scan only the
> partions which are listed or does it has to scan whole table? What is the
> best approach add new column and query it based on existing primary key
> plus new column?
>


Re: Datafile Corruption

2019-08-08 Thread Jon Haddad
Any chance you're using NVMe with an older Linux kernel?  I've seen a *lot*
filesystem errors from using older CentOS versions.  You'll want to be
using a version > 4.15.

On Thu, Aug 8, 2019 at 9:31 AM Philip Ó Condúin 
wrote:

> *@Jeff *- If it was hardware that would explain it all, but do you think
> it's possible to have every server in the cluster with a hardware issue?
> The data is sensitive and the customer would lose their mind if I sent it
> off-site which is a pity cause I could really do with the help.
> The corruption is occurring irregularly on every server and instance and
> column family in the cluster.  Out of 72 instances, we are getting maybe 10
> corrupt files per day.
> We are using vnodes (256) and it is happening in both DC's
>
> *@Asad *- internode compression is set to ALL on every server.  I have
> checked the packets for the private interconnect and I can't see any
> dropped packets, there are dropped packets for other interfaces, but not
> for the private ones, I will get the network team to double-check this.
> The corruption is only on the application schema, we are not getting
> corruption on any system or cass keyspaces.  Corruption is happening in
> both DC's.  We are getting corruption for the 1 application schema we have
> across all tables in the keyspace, it's not limited to one table.
> Im not sure why the app team decided to not use default compression, I
> must ask them.
>
>
>
> I have been checking the /var/log/messages today going back a few weeks
> and can see a serious amount of broken pipe errors across all servers and
> instances.
> Here is a snippet from one server but most pipe errors are similar:
>
> Jul  9 03:00:08  cassandra: INFO  02:00:08 Writing
> Memtable-sstable_activity@1126262628(43.631KiB serialized bytes, 18072
> ops, 0%/0% of on/off-heap limit)
> Jul  9 03:00:13  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
> Jul  9 03:00:19  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
> Jul  9 03:00:22  cassandra: ERROR 02:00:22 Got an IOException during write!
> Jul  9 03:00:22  cassandra: java.io.IOException: Broken pipe
> Jul  9 03:00:22  cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native
> Method) ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at
> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at
> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> ~[na:1.8.0_172]
> Jul  9 03:00:22  cassandra: at
> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
> ~[libthrift-0.9.2.jar:0.9.2]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
> ~[thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.util.mem.FastMemoryOutputTransport.streamTo(FastMemoryOutputTransport.java:112)
> ~[thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.Message.write(Message.java:222)
> ~[thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.handleWrite(TDisruptorServer.java:598)
> [thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.TDisruptorServer$SelectorThread.processKey(TDisruptorServer.java:569)
> [thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.select(TDisruptorServer.java:423)
> [thrift-server-0.3.7.jar:na]
> Jul  9 03:00:22  cassandra: at
> com.thinkaurelius.thrift.TDisruptorServer$AbstractSelectorThread.run(TDisruptorServer.java:383)
> [thrift-server-0.3.7.jar:na]
> Jul  9 03:00:25  kernel: fnic_handle_fip_timer: 8 callbacks suppressed
> Jul  9 03:00:30  cassandra: ERROR 02:00:30 Got an IOException during write!
> Jul  9 03:00:30  cassandra: java.io.IOException: Broken pipe
> Jul  9 03:00:30  cassandra: at sun.nio.ch.FileDispatcherImpl.write0(Native
> Method) ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at
> sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at
> sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at sun.nio.ch.IOUtil.write(IOUtil.java:65)
> ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at
> sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> ~[na:1.8.0_172]
> Jul  9 03:00:30  cassandra: at
> org.apache.thrift.transport.TNonblockingSocket.write(TNonblockingSocket.java:165)
> ~[libthrift-0.9.2.jar:0.9.2]
> Jul  9 03:00:30  cassandra: at
> com.thinkaurelius.thrift.util.mem.Buffer.writeTo(Buffer.java:104)
> ~[thrift-server-0.3.7.jar:na]
> Jul  9 03:00:30  cassandra: at
> 

Re: Cassandra read requests not getting timeout

2019-08-05 Thread Jon Haddad
I think this might be because the timeout only applied to each request, and
the driver is paginating in the background. Each page is a new request.

On Mon, Aug 5, 2019, 12:08 AM Oleksandr Shulgin <
oleksandr.shul...@zalando.de> wrote:

> On Mon, Aug 5, 2019 at 8:50 AM nokia ceph 
> wrote:
>
>> Hi Community,
>>
>> I am using Cassanadra 3.0.13 . 5 node cluster simple topology. Following
>> are the timeout  parameters in yaml file:
>>
>> # grep timeout /etc/cassandra/conf/cassandra.yaml
>> cas_contention_timeout_in_ms: 1000
>> counter_write_request_timeout_in_ms: 5000
>> cross_node_timeout: false
>> range_request_timeout_in_ms: 1
>> read_request_timeout_in_ms: 1
>> request_timeout_in_ms: 1
>> truncate_request_timeout_in_ms: 6
>> write_request_timeout_in_ms: 2000
>>
>> i'm trying a cassandra query using cqlsh and it is not getting timeout.
>>
>> #time cqlsh 10.50.11.11 -e "CONSISTENCY QUORUM; select
>> asset_name,profile_name,job_index,active,last_valid_op,last_valid_op_ts,status,status_description,live_depth,asset_type,dest_path,source_docroot_name,source_asset_name,start_time,end_time,iptv,drm,geo,last_gc
>> from cdvr.jobs where model_type ='asset' AND docroot_name='vx030'
>>  LIMIT 10 ALLOW FILTERING;"
>> Consistency level set to QUORUM.
>> ()
>> ()
>> (79024 rows)
>>
>> real16m30.488s
>> user0m39.761s
>> sys 0m3.896s
>>
>> The query took 16.5 minutes  to display the output. But my
>> read_request_timeout is 10 seconds. why the query doesn't got timeout after
>> 10 s ??
>>
>
> Hi Renoy,
>
> Have you tried the same query with enabling TRACING beforehand?
>
> https://docs.datastax.com/en/archived/cql/3.3/cql/cql_reference/cqlshTracing.html
>
> It doesn't sound all too likely that it has taken the client 16 minutes to
> display the resultset, but this is definitely not included in the request
> timeout from the server point of view.
>
> Cheers,
> --
> Alex
>
>


Re: Cheat Sheet for Unix based OS, Performance troubleshooting

2019-07-28 Thread Jon Haddad
http://www.brendangregg.com/linuxperf.html

On Sat, Jul 27, 2019 at 2:45 AM Paul Chandler  wrote:

> I have always found Amy's Cassandra 2.1 tuning guide great for the Linux
> performance tuning:
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>
> Sent from my iPhone
>
> On 26 Jul 2019, at 23:49, Krish Donald  wrote:
>
> Any one has  Cheat Sheet for Unix based OS, Performance troubleshooting ?
>
>


Re: Performance impact with ALLOW FILTERING clause.

2019-07-25 Thread Jon Haddad
If you're thinking about rewriting your data to be more performant when
doing analytics, you might as well go the distance and put it in an
analytics friendly format like Parquet.  My 2 cents.

On Thu, Jul 25, 2019 at 11:01 AM ZAIDI, ASAD A  wrote:

> Thank you all for your insights.
>
>
>
> When spark-connector adds allows filtering to a query, it makes the query
> to just ‘run’ no matter if it is expensive for larger table OR  not so
> expensive for table with fewer rows.
>
> In my particular case, nodes are reaching 2TB/per node load in 50 node
> cluster. When bunch of such queries run ,  causes impact on server
> resources.
>
>
>
> Since allow filtering is an expensive operation - I’m trying find knobs
> which if I turn, mitigate the impact.
>
>
>
> What I think , correct me if I am wrong , is – it is query design itself
> which is not optimized per table design  - that in turn causing connector
> to add allow filtering implicitly.  I’m not thinking to add secondary
> indexes on tables because they’ve their own overheads.  kindly share if
> there are  other means which we can use to influence connector not to use
> allow filtering.
>
>
>
> Thanks again.
>
> Asad
>
>
>
>
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com]
> *Sent:* Thursday, July 25, 2019 10:24 AM
> *To:* cassandra 
> *Subject:* Re: Performance impact with ALLOW FILTERING clause.
>
>
>
> "unpredictable" is such a loaded word. It's quite predictable, but it's
> often mispredicted by users.
>
>
>
> "ALLOW FILTERING" basically tells the database you're going to do a query
> that will require scanning a bunch of data to return some subset of it, and
> you're not able to provide a WHERE clause that's sufficiently fine grained
> to avoid the scan. It's a loose equivalent of doing a full table scan in
> SQL databases - sometimes it's a valid use case, but it's expensive, you're
> ignoring all of the indexes, and you're going to do a lot more work.
>
>
>
> It's predictable, though - you're probably going to walk over some range
> of data. Spark is grabbing all of the data to load into RDDs, and it
> probably does it by slicing up the range, doing a bunch of range scans.
>
>
>
> It's doing that so it can get ALL of the data and do the filtering /
> joining / searching in-memory in spark, rather than relying on cassandra to
> do the scanning/searching on disk.
>
>
>
> On Thu, Jul 25, 2019 at 6:49 AM ZAIDI, ASAD A  wrote:
>
> Hello Folks,
>
>
>
> I was going thru documentation and saw at many places saying ALLOW
> FILTERING causes performance unpredictability.  Our developers says ALLOW
> FILTERING clause is implicitly added on bunch of queries by spark-Cassandra
>  connector and they cannot control it; however at the same time we see
> unpredictability in application performance – just as documentation says.
>
>
>
> I’m trying to understand why would a connector add a clause in query when
> this can cause negative impact on database/application performance. Is that
> data model that is driving connector make its decision and add allow
> filtering to query automatically or if there are other reason this clause
> is added to the code. I’m not a developer though I want to know why
> developer don’t have any control on this to happen.
>
>
>
> I’ll appreciate your guidance here.
>
>
>
> Thanks
>
> Asad
>
>
>
>
>
>


Re: Materialized View's additional PrimaryKey column

2019-07-25 Thread Jon Haddad
The issues I have with MVs aren't related to how they aren't correctly
synchronized, although I'm not happy about that either.  My issue with them
are in every cluster I've seen that uses them, the cluster has been
unstable, and I've put a lot of time into helping teams undo them.  You
will almost certainly have several hours or days of downtime as a result of
using them.

There's a good reason they're marked as experimental (and disabled by
default).  You should maintain the other tables yourself.

Jon



On Thu, Jul 25, 2019 at 12:22 AM mehmet bursali 
wrote:

> Hi Jon, thanks for your suggestion (or warning :) ).
> yes, i've read sth. about your point and i know that just because of
> using MVs, there are really several issues open in JIRA on bootstrapping,
> compaction and incremental repair stuff   but, after reading almost all
> jira tickets (with comments and history) related to using MVs,  AFAU  all
> that issues come out by either loosing syncronization between base table
> and MV by deleting columns or rows values on base table or having a huge
> system that has large and dynamic number of nodes/data/workloads. We use
> 3.11.3 version and most of the critical issues were fixed on 3.10 but  of
> course I might be miss sth so i 'll be glad if you point me some specific
> jira ticket.
> We have a certain use case that require updates on filtering (clustering)
> columns.Our motivation for using MV was avoiding updates (delete +
> create) on primaryKey columns  because we suppose that cassandra developers
> can manage this unpreferred operation better then us. I'm really confused
> now.
>
>
>
> On Wednesday, July 24, 2019, 11:30:15 PM GMT+3, Jon Haddad <
> j...@jonhaddad.com> wrote:
>
>
> I really, really advise against using MVs.  I've had to help a number of
> teams move off them.  Not sure what list of bugs you read, but if the list
> didn't include "will destabilize your cluster to the point of constant
> downtime" then the list was incomplete.
>
> Jon
>
> On Wed, Jul 24, 2019 at 6:32 AM mehmet bursali 
> wrote:
>
> + additional info: our production environment is a multiDC cluster that
> consist of 6 nodes in 2 DataCenters
>
>
>
>
> On Wednesday, July 24, 2019, 3:35:11 PM GMT+3, mehmet bursali
>  wrote:
>
>
> Hi Cassandra folks,
> I'm planning to use Materialized View (MV) on production for some specific
> cases.  I've read a lot of blogs, technical documents about the risks of
> using it  and everything seems ok for our use case.
> My question is about consistency(also durability) evaluation of MV usage
> with an additional primary key column.  İn one of our case, we select an
> UDT column of base table as addtional primary key column on MV. (UDT
> possible values are non nullable and restricted with domain.) . After
> inserting a record in base table, this additonal column (MVs primary key
> column)
> value also will be updated  for 1 or 2 time. So in our case,  for each
> update operation that will be occured on base table there are going to be
> delete and create operations inside MV.
> Does it matter  from consistency(also durability) perspective that using
> additional primary key column whether as partition column or  clustering
> column?
>
>


Re: Materialized View's additional PrimaryKey column

2019-07-24 Thread Jon Haddad
I really, really advise against using MVs.  I've had to help a number of
teams move off them.  Not sure what list of bugs you read, but if the list
didn't include "will destabilize your cluster to the point of constant
downtime" then the list was incomplete.

Jon

On Wed, Jul 24, 2019 at 6:32 AM mehmet bursali 
wrote:

> + additional info: our production environment is a multiDC cluster that
> consist of 6 nodes in 2 DataCenters
>
>
>
>
> On Wednesday, July 24, 2019, 3:35:11 PM GMT+3, mehmet bursali
>  wrote:
>
>
> Hi Cassandra folks,
> I'm planning to use Materialized View (MV) on production for some specific
> cases.  I've read a lot of blogs, technical documents about the risks of
> using it  and everything seems ok for our use case.
> My question is about consistency(also durability) evaluation of MV usage
> with an additional primary key column.  İn one of our case, we select an
> UDT column of base table as addtional primary key column on MV. (UDT
> possible values are non nullable and restricted with domain.) . After
> inserting a record in base table, this additonal column (MVs primary key
> column)
> value also will be updated  for 1 or 2 time. So in our case,  for each
> update operation that will be occured on base table there are going to be
> delete and create operations inside MV.
> Does it matter  from consistency(also durability) perspective that using
> additional primary key column whether as partition column or  clustering
> column?
>
>


Re: Compaction throughput

2019-07-19 Thread Jon Haddad
It's a limit on the total compaction throughput.

On Fri, Jul 19, 2019 at 10:39 AM Vlad  wrote:

> Hi,
>
> is  'nodetool setcompactionthroughput' sets limit for all compactions on
> the node, or is it per compaction thread?
>
> Thanks.
>


Re: Running Node Repair After Changing RF or Replication Strategy for a Keyspace

2019-06-28 Thread Jon Haddad
Yep - not to mention the increased complexity and overhead of going from
ONE to QUORUM, or the increased cost of QUORUM in RF=5 vs RF=3.

If you're in a cloud provider, I've found you're almost always better off
adding a new DC with a higher RF, assuming you're on NTS like Jeff
mentioned.

On Fri, Jun 28, 2019 at 2:29 PM Jeff Jirsa  wrote:

> For just changing RF:
>
> You only need to repair the full token range - how you do that is up to
> you. Running `repair -pr -full` on each node will do that. Running `repair
> -full` will do it multiple times, so it's more work, but technically
> correct.The caveat that few people actually appreciate about changing
> replication factors (# of copies per DC) is that you often have to run
> repair after each increment - going from 3 -> 5 means 3 -> 4, repair, 4 ->
> 5 - just going 3 -> 5 will violate consistency guarantees, and is
> technically unsafe.
>
> For changing replication strategy:
>
> Changing replication strategy is nontrivial - going from Simple to NTS is
> often easy to do in a truly eventual consistency use case, but becomes much
> harder if you're:
> - using multiple DCs or
> - vnodes + racks or
> - if you must do it without violating consistency.
>
> It turns out if you're not using multiple DCs or racks, then
> simplestrategy is fine. But if you are using multiple DCs/racks, then
> changing is very very hard. So usually by the time you're asking how to do
> this, you're in a very bad position.
>
> Do you have simple strategy and multiple DCs?
> Are you using vnodes and racks?
>
> I'd be incredibly skeptical about any blog that tried to give concrete
> steps on how to do this - the steps are probably right 80% of the time, but
> horribly wrong 20% of the time, especially if there's not a paragraph or
> two about racks along the way.
>
>
>
>
>
> On Fri, Jun 28, 2019 at 7:52 AM Fd Habash  wrote:
>
>> Hi all …
>>
>>
>>
>> The datastax & apache docs are clear: run ‘nodetool repair’ after you
>> alter a keyspace to change its RF or RS.
>>
>>
>>
>> However, the details are all over the place as what type of repair and on
>> what nodes it needs to run. None of the above doc authorities are clear and
>> what you find on the internet is quite contradictory.
>>
>>
>>
>> For example, this IBM doc
>> 
>> suggest to run both the ‘alter keyspace’ and repair on EACH node affected
>> or on ‘each node you need to change the RF on’.  Others
>> ,
>> suggest to run ‘repair -pr’.
>>
>>
>>
>> On a cluster of 1 DC and three racks, this is how I understand it ….
>>
>>1. Run the ‘alter keyspace’ on a SINGLE node.
>>2. As for repairing the altered keyspac, I assume there are two
>>options …
>>   1. Run ‘repair -full [key_space]’ on all nodes in all racks
>>   2. Run ‘repair -pr -full [keyspace] on all nodes in all racks
>>
>>
>>
>> Sounds correct?
>>
>>
>>
>> 
>> Thank you
>>
>>
>>
>


Re: Recover lost node from backup or evict/re-add?

2019-06-12 Thread Jon Haddad
100% agree with Sean.  I would only use Cassandra backups in a case where
you need to restore from full cluster loss.  Example: An entire DC burns
down, tornado, flooding.

Your routine node replacement after a failure should be
replace_address_first_boot.

To ensure this goes smoothly, run regular repairs.  We (The Last Pickle)
maintain this to make it easy: http://cassandra-reaper.io/

Jon


On Wed, Jun 12, 2019 at 11:17 AM Durity, Sean R 
wrote:

> I’m not sure it is correct to say, “you cannot.” However, that is a more
> complicated restore and more likely to lead to inconsistent data and take
> longer to do. You are basically trying to start from a backup point and
> roll everything forward and catch up to current.
>
>
>
> Replacing/re-streaming is the well-trodden path. You are getting the net
> result of all that has happened since the node failure. And the node is not
> returning data to the clients while the bootstrap is running. If you have a
> restored/repairing node, it will accept client (and coordinator)
> connections even though it isn’t (guaranteed) consistent, yet.
>
>
>
> As I understand it – a full cluster recovery from backup still requires
> repair across the cluster to ensure consistency. In my experience, most
> apps cannot wait for a full restore/repair. Availability matters more. They
> also don’t want to pay for even more disk to hold some level of backups.
>
>
>
> There are some companies that provide finer-grained backup and recovery
> options, though.
>
>
>
> Sean Durity
>
>
>
> *From:* Alan Gano 
> *Sent:* Wednesday, June 12, 2019 1:43 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] RE: Recover lost node from backup or evict/re-add?
>
>
>
>
>
> Is it correct to say that a lost node cannot be restored from backup?  You
> must either replace the node or evict/re-add (i.e., rebuild from other
> nodes).
>
>
>
> Also, that snapshot, incremental, commitlog backups are relegated to
> application keyspace recovery only?
>
>
>
>
>
> How about recovery of the entire cluster? (rolling it back).  Are
> snapshots exact enough, in time, to not have a nodes that differ, in
> point-in-time, from the rest of the cluster?  Would those nodes be
> recoverable (nodetool repair?) … which brings me back to recovering a lost
> node from backup (restore last snapshot, and run nodetool repair?).
>
>
>
>
>
> Thanks,
>
>
>
> Alan Gano
>
>
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com ]
> *Sent:* Wednesday, June 12, 2019 10:14 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Recover lost node from backup or evict/re-add?
>
>
>
> A host can replace itself using the method I described
>
>
> On Jun 12, 2019, at 7:10 AM, Alan Gano  wrote:
>
> I guess I’m considering this scenario:
>
>- host and configuration have survived
>- /data is gone
>- /backups have survived
>
>
>
> I have tested recovering from this scenario with an evict/re-add, which
> worked fine.
>
>
>
> If I restore from backup, the node will be behind the cluster – e,
> does it get caught up after a restore and start it up?
>
>
>
> Alan
>
>
>
> *From:* Jeff Jirsa [mailto:jji...@gmail.com ]
> *Sent:* Wednesday, June 12, 2019 10:02 AM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Recover lost node from backup or evict/re-add?
>
>
>
> To avoid violating consistency guarantees, you have to repair the replicas
> while the lost node is down
>
>
>
> Once you do that it’s typically easiest to bootstrap a replacement
> (there’s a property named “replace address first boot” you can google or
> someone can link) that tells a new joining host to take over for a failed
> machine.
>
>
>
>
> On Jun 12, 2019, at 6:54 AM, Alan Gano  wrote:
>
>
>
> If I lose a node, does it make sense to even restore from
> snapshot/incrementals/commitlogs?
>
>
>
> Or is the best way to do an evict/re-add?
>
>
>
>
>
> Thanks,
>
>
>
> Alan.
>
>
>
> NOTICE: This communication is intended only for the person or entity to
> whom it is addressed and may contain confidential, proprietary, and/or
> privileged material. Unless you are the intended addressee, any review,
> reliance, dissemination, distribution, copying or use whatsoever of this
> communication is strictly prohibited. If you received this in error, please
> reply immediately and delete the material from all computers. Email sent
> through the Internet is not secure. Do not use email to send us
> confidential information such as credit card numbers, PIN numbers,
> passwords, Social Security Numbers, Account numbers, or other important and
> confidential information.
>
> NOTICE: This communication is intended only for the person or entity to
> whom it is addressed and may contain confidential, proprietary, and/or
> privileged material. Unless you are the intended addressee, any review,
> reliance, dissemination, distribution, copying or use whatsoever of this
> communication is strictly prohibited. If you received this in error, please
> reply immediately and delete the material 

Re: Collecting Latency Metrics

2019-05-30 Thread Jon Haddad
Yep.  I would *never* use mean when it comes to performance to make any
sort of decisions.  I prefer to graph all the p99 latencies as well as the
max.

Some good reading on the topic:
https://bravenewgeek.com/everything-you-know-about-latency-is-wrong/

On Thu, May 30, 2019 at 7:35 AM Chris Lohfink  wrote:

> For what it is worth, generally I would recommend just using the mean vs
> calculating it yourself. It's a lot easier and averages are meaningless for
> anything besides trending anyway (which is really what this is useful for,
> finding issues on the larger scale), especially with high volume clusters
> so the loss in accuracy kinda moot. Your average for local reads/writes
> will almost always be sub millisecond but you might end up having 500
> millisecond requests or worse that the mean will hide.
>
> Chris
>
> On Thu, May 30, 2019 at 6:30 AM shalom sagges 
> wrote:
>
>> Thanks for your replies guys. I really appreciate it.
>>
>> @Alain, I use Graphite for backend on top of Grafana. But the goal is to
>> move from Graphite to Prometheus eventually.
>>
>> I tried to find a direct way of getting a specific Latency metric in
>> average and as Chris pointed out, then Mean value isn't that accurate.
>> I do not wish to use the percentile metrics either, but a single latency
>> metric like the *"Local read latency" *output in nodetool tablestats.
>> Looking at the code of nodetool tablestats, it seems that C* also divides
>> *ReadTotalLatency.Count* with *ReadLatency.Count *to get the latency
>> result.
>>
>> So I guess I will have no choice but to run the calculation on my own via
>> Graphite:
>>
>> divideSeries(averageSeries(keepLastValue(nonNegativeDerivative($env.path.to.host.$host.org_apache_cassandra_metrics.Table.$ks.$cf.ReadTotalLatency.Count))),averageSeries(keepLastValue(nonNegativeDerivative($env.path.to.host.$host.org_apache_cassandra_metrics.Table.$ks.$cf.ReadLatency.Count
>>
>> Does this seem right to you?
>>
>> Thanks!
>>
>> On Thu, May 30, 2019 at 12:34 AM Paul Chandler  wrote:
>>
>>> There are various attributes under
>>> org.apache.cassandra.metrics.ClientRequest.Latency.Read these measure the
>>> latency in milliseconds
>>>
>>> Thanks
>>>
>>> Paul
>>> www.redshots.com
>>>
>>> > On 29 May 2019, at 15:31, shalom sagges 
>>> wrote:
>>> >
>>> > Hi All,
>>> >
>>> > I'm creating a dashboard that should collect read/write latency
>>> metrics on C* 3.x.
>>> > In older versions (e.g. 2.0) I used to divide the total read latency
>>> in microseconds with the read count.
>>> >
>>> > Is there a metric attribute that shows read/write latency without the
>>> need to do the math, such as in nodetool tablestats "Local read latency"
>>> output?
>>> > I saw there's a Mean attribute in
>>> org.apache.cassandra.metrics.ReadLatency but I'm not sure this is the right
>>> one.
>>> >
>>> > I'd really appreciate your help on this one.
>>> > Thanks!
>>> >
>>> >
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>>
>>>


Re: Re: How to set up a cluster with allocate_tokens_for_keyspace?

2019-05-05 Thread Jon Haddad
I mean you'd want to set up the initial tokens for the first 3 nodes
of your cluster, which are usually the seed nodes.


On Sat, May 4, 2019 at 8:31 PM onmstester onmstester
 wrote:
>
> So do you mean setting tokens for only one node (one of the seed node) is 
> fair enough?
> I can not see any problem with this mechanism (only one manual token 
> assignment at cluster set up), but the article was also trying to set up a 
> balanced cluster and the way that it insist on doing manual token assignment 
> for multiple seed nodes, confused me.
>
> Sent using Zoho Mail
>
>
>
>  Forwarded message 
> From: Jon Haddad 
> To: 
> Date: Sat, 04 May 2019 22:10:39 +0430
> Subject: Re: How to set up a cluster with allocate_tokens_for_keyspace?
>  Forwarded message 
>
> That line is only relevant for when you're starting your cluster and
> you need to define your initial tokens in a non-random way. Random
> token distribution doesn't work very well when you only use 4 tokens.
>
> Once you get the cluster set up you don't need to specify tokens
> anymore, you can just use allocate_tokens_for_keyspace.
>
> On Sat, May 4, 2019 at 2:14 AM onmstester onmstester
>  wrote:
> >
> > I just read this article by tlp:
> > https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
> >
> > Noticed that:
> > >>We will need to set the tokens for the seed nodes in each rack manually. 
> > >>This is to prevent each node from randomly calculating its own token 
> > >>ranges
> >
> > But until now, i was using this recommendation to setup a new cluster:
> > >>
> >
> > You'll want to set them explicitly using: python -c 'print( [str(((2**64 / 
> > 4) * i) - 2**63) for i in range(4)])'
> >
> >
> > After you fire up the first seed, create a keyspace using RF=3 (or whatever 
> > you're planning on using) and set allocate_tokens_for_keyspace to that 
> > keyspace in your config, and join the rest of the nodes. That gives even
> > distribution.
> >
> > I've defined plenty of racks in my cluster (and only 3 seed nodes), should 
> > i have a seed node per rack and use initial_token for all of the seed nodes 
> > or just one seed node with inital_token would be ok?
> >
> > Best Regards
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Priority in IN () cqlsh comand

2019-05-05 Thread Jon Haddad
Do separate queries for each partition you want.  There's no benefit
in using the IN() clause here, and performance is significantly worse
with multi-partition IN(), especially if the partitions are small.

On Sun, May 5, 2019 at 4:52 AM Soheil Pourbafrani  wrote:
>
> Hi,
>
> I want to run cqlsh query on cassandra table using IN
>
> SELECT * from data WHERE nid = 'value' AND mm IN (201905,201904) AND tid 
> = 'value2' AND ts >= 155639466 AND ts <= 155699946 ;
>
> The nid and mm columns are partition key and the ts is clustering key.
> The problem is cassandra didn't care about the order of the IN List and 
> always return 201904 partition data first and after that it return 201905 
> partition data, but I wanted to 201905 partition data to come first.
>
> Is there any solution for this?

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: How to set up a cluster with allocate_tokens_for_keyspace?

2019-05-04 Thread Jon Haddad
That line is only relevant for when you're starting your cluster and
you need to define your initial tokens in a non-random way.  Random
token distribution doesn't work very well when you only use 4 tokens.

Once you get the cluster set up you don't need to specify tokens
anymore, you can just use allocate_tokens_for_keyspace.

On Sat, May 4, 2019 at 2:14 AM onmstester onmstester
 wrote:
>
> I just read this article by tlp:
> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>
> Noticed that:
> >>We will need to set the tokens for the seed nodes in each rack manually. 
> >>This is to prevent each node from randomly calculating its own token ranges
>
>  But until now, i was using this recommendation to setup a new cluster:
> >>
>
> You'll want to set them explicitly using: python -c 'print( [str(((2**64 / 4) 
> * i) - 2**63) for i in range(4)])'
>
>
> After you fire up the first seed, create a keyspace using RF=3 (or whatever 
> you're planning on using) and set allocate_tokens_for_keyspace to that 
> keyspace in your config, and join the rest of the nodes. That gives even
> distribution.
>
> I've defined plenty of racks in my cluster (and only 3 seed nodes), should i 
> have a seed node per rack and use initial_token for all of the seed nodes or 
> just one seed node with inital_token would be ok?
>
> Best Regards
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Increasing the size limits implications

2019-04-30 Thread Jon Haddad
Just curious - why are you using such large batches?  Most of the time
when someone asks this question, it's because they're using batches as
they would in an RDBMS, because larger transactions improve
performance.  That doesn't apply with Cassandra.

Batches are OK at keeping multiple tables in sync, that's about it.

Jon

On Mon, Apr 29, 2019 at 10:18 AM Bobbie Haynes  wrote:
>
> Hi,
>   I'm inserting into cassandra in batches(With each containing single PK 
> ).But my batch is failing and throwing exceptions.
> I want to know if we increase batch_size_warn_threshold_in_kb to 200KB and 
> batch_size_fail_threshold_in_kb to 300KB. What could be potential issues i 
> could be facing
>
> Thanks,
> Bobbie

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: [EXTERNAL] multiple Cassandra instances per server, possible?

2019-04-18 Thread Jon Haddad
Agreed with Jeff here.  The whole "community recommends no more than
1TB" has been around, and inaccurate, for a long time.

The biggest issue with dense nodes is how long it takes to replace
them.  4.0 should help with that under certain circumstances.


On Thu, Apr 18, 2019 at 6:57 AM Jeff Jirsa  wrote:
>
> Agreed that you can go larger than 1T on ssd
>
> You can do this safely with both instances in the same cluster if you 
> guarantee two replicas aren’t on the same machine. Cassandra provides a 
> primitive to do this - rack awareness through the network topology snitch.
>
> The limitation (until 4.0) is that you’ll need two IPs per machine as both 
> instances have to run in the same port.
>
>
> --
> Jeff Jirsa
>
>
> On Apr 18, 2019, at 6:45 AM, Durity, Sean R  
> wrote:
>
> What is the data problem that you are trying to solve with Cassandra? Is it 
> high availability? Low latency queries? Large data volumes? High concurrent 
> users? I would design the solution to fit the problem(s) you are solving.
>
>
>
> For example, if high availability is the goal, I would be very cautious about 
> 2 nodes/machine. If you need the full amount of the disk – you *can* have 
> larger nodes than 1 TB. I agree that administration tasks (like 
> adding/removing nodes, etc.) are more painful with large nodes – but not 
> impossible. For large amounts of data, I like nodes that have about 2.5 – 3 
> TB of usable SSD disk.
>
>
>
> It is possible that your nodes might be under-utilized, especially at first. 
> But if the hardware is already available, you have to use what you have.
>
>
>
> We have done multiple nodes on single physical hardware, but they were two 
> separate clusters (for the same application). In that case, we had  a 
> different install location and different ports for one of the clusters.
>
>
>
> Sean Durity
>
>
>
> From: William R 
> Sent: Thursday, April 18, 2019 9:14 AM
> To: user@cassandra.apache.org
> Subject: [EXTERNAL] multiple Cassandra instances per server, possible?
>
>
>
> Hi all,
>
>
>
> In our small company we have 10 nodes of (2 x 3 TB HD) 6 TB each, 128 GB ram 
> and 64 cores and we are thinking to use them as Cassandra nodes. From what I 
> am reading around, the community recommends that every node should not keep 
> more than 1 TB data so in this case I am wondering if it is possible to 
> install 2 instances per node using docker so each docker instance can write 
> to its own physical disk and utilise more efficiently the rest hardware (CPU 
> & RAM).
>
>
>
> I understand with this setup there is the danger of creating a single point 
> of failure for 2 Cassandra nodes but except that do you think that is a 
> possible setup to start with the cluster?
>
>
>
> Except the docker solution do you recommend any other way to split the 
> physical node to 2 instances? (VMWare? or even maybe 2 separate installations 
> of Cassandra? )
>
>
>
> Eventually we are aiming in a cluster consisted of 2 DCs with 10 nodes each 
> (5 baremetal nodes with 2 Cassandra instances)
>
>
>
> Probably later when we will start introducing more nodes to the cluster we 
> can decommissioning the "double-instaned" ones and aim for a more homogeneous 
> solution..
>
>
>
> Thank you,
>
>
>
> Wil
>
>
> 
>
> The information in this Internet Email is confidential and may be legally 
> privileged. It is intended solely for the addressee. Access to this Email by 
> anyone else is unauthorized. If you are not the intended recipient, any 
> disclosure, copying, distribution or any action taken or omitted to be taken 
> in reliance on it, is prohibited and may be unlawful. When addressed to our 
> clients any opinions or advice contained in this Email are subject to the 
> terms and conditions expressed in any applicable governing The Home Depot 
> terms of business or client engagement letter. The Home Depot disclaims all 
> responsibility and liability for the accuracy and content of this attachment 
> and for any damages or losses arising from any inaccuracies, errors, viruses, 
> e.g., worms, trojan horses, etc., or other items of a destructive nature, 
> which may be contained in this attachment and shall not be liable for direct, 
> indirect, consequential or special damages in connection with this e-mail 
> message or its attachment.

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

2019-04-17 Thread Jon Haddad
Let me be more specific - run the async java profiler and generate a
flame graph to determine where CPU time is spent.

On Wed, Apr 17, 2019 at 11:36 AM Jon Haddad  wrote:
>
> Run the async java profiler on the node to determine what it's doing:
> https://github.com/jvm-profiling-tools/async-profiler
>
> On Wed, Apr 17, 2019 at 11:31 AM Carl Mueller
>  wrote:
> >
> > No, we just did the package upgrade 2.1.9 --> 2.2.13
> >
> > It definitely feels like some indexes are being recalculated or the entire 
> > sstables are being scanned due to suspected corruption.
> >
> >
> > On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa  wrote:
> >>
> >> There was a time when changing some of the parameters (especially bloom 
> >> filter FP ratio) would cause the bloom filters to be rebuilt on startup if 
> >> the sstables didnt match what was in the schema, leading to a delay like 
> >> that and similar logs. Any chance you changed the schema on that table 
> >> since the last time you restarted it?
> >>
> >>
> >>
> >> On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller 
> >>  wrote:
> >>>
> >>> Oh, the table in question is SizeTiered, had about 10 sstables total, it 
> >>> was JBOD across two data directories.
> >>>
> >>> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller 
> >>>  wrote:
> >>>>
> >>>> We are doing a ton of upgrades to get out of 2.1.x. We've done probably 
> >>>> 20-30 clusters so far and have not encountered anything like this yet.
> >>>>
> >>>> After upgrade of a node, the restart takes a long time. like 10 minutes 
> >>>> long. ALmost all of our other nodes took less than 2 minutes to upgrade 
> >>>> (aside from sstableupgrades).
> >>>>
> >>>> The startup stalls on a particular table, it is the largest table at 
> >>>> about 300GB, but we have upgraded other clusters with about that much 
> >>>> data without this 8-10 minute delay. We have the ability to roll back 
> >>>> the node, and the restart as a 2.1.x node is normal with no delays.
> >>>>
> >>>> Alas this is a prod cluster so we are going to try to sstable load the 
> >>>> data on a lower environment and try to replicate the delay. If we can, 
> >>>> we will turn on debug logging.
> >>>>
> >>>> This occurred on the first node we tried to upgrade. It is possible it 
> >>>> is limited to only this node, but we are gunshy to play around with 
> >>>> upgrades in prod.
> >>>>
> >>>> We have an automated upgrading program that flushes, snapshots, shuts 
> >>>> down gossip, drains before upgrade, suppressed autostart on upgrade, and 
> >>>> has worked about as flawlessly as one could hope for so far for 2.1->2.2 
> >>>> and 2.2-> 3.11 upgrades.
> >>>>
> >>>> INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - 
> >>>> Initializing .access_token
> >>>> INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 - 
> >>>> Initializing .refresh_token
> >>>> INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - 
> >>>> Initializing .userid
> >>>> INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 - 
> >>>> Initializing .access_token_by_auth
> >>>>
> >>>> You can see the 6:30 delay in the startup log above. All the other 
> >>>> keyspace/tables initialize in under a second.
> >>>>
> >>>>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: 2.1.9 --> 2.2.13 upgrade node startup after upgrade very slow

2019-04-17 Thread Jon Haddad
Run the async java profiler on the node to determine what it's doing:
https://github.com/jvm-profiling-tools/async-profiler

On Wed, Apr 17, 2019 at 11:31 AM Carl Mueller
 wrote:
>
> No, we just did the package upgrade 2.1.9 --> 2.2.13
>
> It definitely feels like some indexes are being recalculated or the entire 
> sstables are being scanned due to suspected corruption.
>
>
> On Wed, Apr 17, 2019 at 12:32 PM Jeff Jirsa  wrote:
>>
>> There was a time when changing some of the parameters (especially bloom 
>> filter FP ratio) would cause the bloom filters to be rebuilt on startup if 
>> the sstables didnt match what was in the schema, leading to a delay like 
>> that and similar logs. Any chance you changed the schema on that table since 
>> the last time you restarted it?
>>
>>
>>
>> On Wed, Apr 17, 2019 at 10:30 AM Carl Mueller 
>>  wrote:
>>>
>>> Oh, the table in question is SizeTiered, had about 10 sstables total, it 
>>> was JBOD across two data directories.
>>>
>>> On Wed, Apr 17, 2019 at 12:26 PM Carl Mueller 
>>>  wrote:

 We are doing a ton of upgrades to get out of 2.1.x. We've done probably 
 20-30 clusters so far and have not encountered anything like this yet.

 After upgrade of a node, the restart takes a long time. like 10 minutes 
 long. ALmost all of our other nodes took less than 2 minutes to upgrade 
 (aside from sstableupgrades).

 The startup stalls on a particular table, it is the largest table at about 
 300GB, but we have upgraded other clusters with about that much data 
 without this 8-10 minute delay. We have the ability to roll back the node, 
 and the restart as a 2.1.x node is normal with no delays.

 Alas this is a prod cluster so we are going to try to sstable load the 
 data on a lower environment and try to replicate the delay. If we can, we 
 will turn on debug logging.

 This occurred on the first node we tried to upgrade. It is possible it is 
 limited to only this node, but we are gunshy to play around with upgrades 
 in prod.

 We have an automated upgrading program that flushes, snapshots, shuts down 
 gossip, drains before upgrade, suppressed autostart on upgrade, and has 
 worked about as flawlessly as one could hope for so far for 2.1->2.2 and 
 2.2-> 3.11 upgrades.

 INFO  [main] 2019-04-16 17:22:17,004 ColumnFamilyStore.java:389 - 
 Initializing .access_token
 INFO  [main] 2019-04-16 17:22:17,096 ColumnFamilyStore.java:389 - 
 Initializing .refresh_token
 INFO  [main] 2019-04-16 17:28:52,929 ColumnFamilyStore.java:389 - 
 Initializing .userid
 INFO  [main] 2019-04-16 17:28:52,930 ColumnFamilyStore.java:389 - 
 Initializing .access_token_by_auth

 You can see the 6:30 delay in the startup log above. All the other 
 keyspace/tables initialize in under a second.



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Questions about C* performance related to tombstone

2019-04-09 Thread Jon Haddad
Normal deletes are fine.

Sadly there's a lot of hand wringing about tombstones in the generic
sense which leads people to try to work around *every* case where
they're used.  This is unnecessary.  A tombstone over a single row
isn't a problem, especially if you're only fetching that one row back.
Tombstones can be quite terrible under a few conditions:

1. When a range tombstone shadows hundreds / thousands / millions of
rows.  This wasn't even detectable prior to Cassandra 3 unless you
were either looking for it specifically or were doing CPU profiling:
http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html
2. When rows were frequently created then deleted, and scanned over.
This is the queue pattern that we detest so much.
3. When they'd be created as a side effect from over writing
collections.  This is an accident typically.

The 'active' flag is good if you want to be able to go back and look
at old deleted assignments.  If you don't care about that, use a
normal delete.

Jon

On Tue, Apr 9, 2019 at 7:00 AM Li, George  wrote:
>
> Hi,
>
> I have a table defined like this:
>
> CREATE TABLE myTable (
> course_id text,
> assignment_id text,
> assignment_item_id text,
> data text,
> boolean active,
> PRIMARY KEY (course_id, assignment_id, assignment_item_id)
> );
> i.e. course_id as the partition key and assignment_id, assignment_item_id as 
> clustering keys.
>
> After data is populated, some delete queries by course_id and assignment_id 
> occurs, e.g. "DELETE FROM myTable WHERE course_id = 'C' AND assignment_id = 
> 'A1';". This would create tombstones so query "SELECT * FROM myTable WHERE 
> course_id = 'C';" would be affected, right? Would query "SELECT * FROM 
> myTable WHERE course_id = 'C' AND assignment_id = 'A2';" be affected too?
>
> For query "SELECT * FROM myTable WHERE course_id = 'C';", to workaround the 
> tombstone problem, we are thinking about not doing hard deletes, instead 
> doing soft deletes. So instead of doing "DELETE FROM myTable WHERE course_id 
> = 'C' AND assignment_id = 'A1';", we do "UPDATE myTable SET active = false 
> WHERE course_id = 'C' AND assignment_id = 'A1';". Then in the application, we 
> do query "SELECT * FROM myTable WHERE course_id = 'C';" and filter out 
> records that have "active" equal to "false". I am not really sure this would 
> improve performance because C* still has to scan through all records with the 
> partition key "C". It is just instead of scanning through X records + Y 
> tombstone records with hard deletes that generate tombstones, it now scans 
> through X + Y records with soft deletes and no tombstones. Am I right?
>
> Thanks.
>
> George

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: How to monitor datastax driver compression performance?

2019-04-09 Thread Jon Haddad
tlp-stress has support for customizing payloads, but it's not
documented very well.  For a given data model (say the KeyValue one),
you can override what tlp-stress will send over.  By default it's
pretty small, a handful of bytes.

If you pass --field.keyvalue.value (the table name + the field name)
then the custom field generator you'd like to use.  For example,
--field.keyvalue.value='random(1,11000)` will generate 10K random
characters.  You can also generate text from real words by using the
book(100,200) function (100-200 random works out of books) if you want
something that will compress better.

You can see a (poorly formatted) list of all the customizations you
can do by running `tlp-stress fields`

This is one the areas I haven't spent enough time on to share with the
world in a carefree manner, but it works.  If you're willing to
overlook the poor docs in the area I think it might meet your needs.

Regarding compression at the query level vs not, I think you should
look at the overhead first.  I'm betting you'll find it's
insignificant.  That said, you can always create two cluster objects
with two radically different settings if you find you need it.

On Tue, Apr 9, 2019 at 6:32 AM Gabriel Giussi  wrote:
>
> tlp-stress allow us to define size of rows? Because I will see the benefit of 
> compression in terms of request rates only if the compression ratio is 
> significant, i.e. requires less network round trips.
> This could be done generating bigger partitions with parameters -n and -p, 
> i.e. decreasing the -p?
>
> Also, don't you think that driver should allow configuring compression per 
> query? Because one table with wide rows could benefit from compression while 
> another one with less payload could not.
>
> Thanks for your help Jon.
>
>
> El lun., 8 abr. 2019 a las 19:13, Jon Haddad () escribió:
>>
>> If it were me, I'd look at raw request rates (in terms of requests /
>> second as well as request latency), network throughput and then some
>> flame graphs of both the server and your application:
>> https://github.com/jvm-profiling-tools/async-profiler.
>>
>> I've created an issue in tlp-stress to add compression options for the
>> driver: https://github.com/thelastpickle/tlp-stress/issues/67.  If
>> you're interested in contributing the feature I think tlp-stress will
>> more or less solve the remainder of the problem for you (the load
>> part, not the os numbers).
>>
>> Jon
>>
>>
>>
>>
>> On Mon, Apr 8, 2019 at 7:26 AM Gabriel Giussi  
>> wrote:
>> >
>> > Hi, I'm trying to test if adding driver compression will bring me any 
>> > benefit.
>> > I understand that the trade-off is less bandwidth but increased CPU usage 
>> > in both cassandra nodes (compression) and client nodes (decompression) but 
>> > I want to know what are the key metrics and how to monitor them to probe 
>> > compression is giving good results?
>> > I guess I should look at latency percentiles reported by 
>> > com.datastax.driver.core.Metrics and CPU usage, but what about bandwith 
>> > usage and compression ratio?
>> > Should I use tcpdump to capture packets length coming from cassandra 
>> > nodes? Something like tcpdump -n "src port 9042 and tcp[13] & 8 != 0" | 
>> > sed -n "s/^.*length \(.*\).*$/\1/p" would be enough?
>> >
>> > Thanks
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
>> For additional commands, e-mail: user-h...@cassandra.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: How to monitor datastax driver compression performance?

2019-04-08 Thread Jon Haddad
If it were me, I'd look at raw request rates (in terms of requests /
second as well as request latency), network throughput and then some
flame graphs of both the server and your application:
https://github.com/jvm-profiling-tools/async-profiler.

I've created an issue in tlp-stress to add compression options for the
driver: https://github.com/thelastpickle/tlp-stress/issues/67.  If
you're interested in contributing the feature I think tlp-stress will
more or less solve the remainder of the problem for you (the load
part, not the os numbers).

Jon




On Mon, Apr 8, 2019 at 7:26 AM Gabriel Giussi  wrote:
>
> Hi, I'm trying to test if adding driver compression will bring me any benefit.
> I understand that the trade-off is less bandwidth but increased CPU usage in 
> both cassandra nodes (compression) and client nodes (decompression) but I 
> want to know what are the key metrics and how to monitor them to probe 
> compression is giving good results?
> I guess I should look at latency percentiles reported by 
> com.datastax.driver.core.Metrics and CPU usage, but what about bandwith usage 
> and compression ratio?
> Should I use tcpdump to capture packets length coming from cassandra nodes? 
> Something like tcpdump -n "src port 9042 and tcp[13] & 8 != 0" | sed -n 
> "s/^.*length \(.*\).*$/\1/p" would be enough?
>
> Thanks

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Assassinate fails

2019-04-04 Thread Jon Haddad
No, it can't.  As Alain (and I) have said, since the system keyspace
is local strategy, it's not replicated, and thus can't be repaired.

On Thu, Apr 4, 2019 at 9:54 AM Kenneth Brotman
 wrote:
>
> Right, could be similar issue, same type of fix though.
>
> -Original Message-----
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 9:52 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> System != system_auth.
>
> On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
>  wrote:
> >
> > From Mastering Cassandra:
> >
> >
> > Forcing read repairs at consistency – ALL
> >
> > The type of repair isn't really part of the Apache Cassandra repair 
> > paradigm at all. When it was discovered that a read repair will trigger 
> > 100% of the time when a query is run at ALL consistency, this method of 
> > repair started to gain popularity in the community. In some cases, this 
> > method of forcing data consistency provided better results than normal, 
> > scheduled repairs.
> >
> > Let's assume, for a second, that an application team is having a hard time 
> > logging into a node in a new data center. You try to cqlsh out to these 
> > nodes, and notice that you are also experiencing intermittent failures, 
> > leading you to suspect that the system_auth tables might be missing a 
> > replica or two. On one node you do manage to connect successfully using 
> > cqlsh. One quick way to fix consistency on the system_auth tables is to set 
> > consistency to ALL, and run an unbound SELECT on every table, tickling each 
> > record:
> >
> > use system_auth ;
> > consistency ALL;
> > consistency level set to ALL.
> >
> > SELECT COUNT(*) FROM resource_role_permissons_index ;
> > SELECT COUNT(*) FROM role_permissions ;
> > SELECT COUNT(*) FROM role_members ;
> > SELECT COUNT(*) FROM roles;
> >
> > This problem is often seen when logging in with the default cassandra user. 
> > Within cqlsh, there is code that forces the default cassandra user to 
> > connect by querying system_auth at QUORUM consistency. This can be 
> > problematic in larger clusters, and is another reason why you should never 
> > use the default cassandra user.
> >
> >
> >
> > -Original Message-
> > From: Jon Haddad [mailto:j...@jonhaddad.com]
> > Sent: Thursday, April 04, 2019 9:21 AM
> > To: user@cassandra.apache.org
> > Subject: Re: Assassinate fails
> >
> > Ken,
> >
> > Alain is right about the system tables.  What you're describing only
> > works on non-local tables.  Changing the CL doesn't help with
> > keyspaces that use LocalStrategy.  Here's the definition of the system
> > keyspace:
> >
> > CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
> > AND durable_writes = true;
> >
> > Jon
> >
> > On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
> >  wrote:
> > >
> > > The trick below I got from the book Mastering Cassandra.  You have to set 
> > > the consistency to ALL for it to work. I thought you guys knew that one.
> > >
> > >
> > >
> > > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> > > Sent: Thursday, April 04, 2019 8:46 AM
> > > To: user cassandra.apache.org
> > > Subject: Re: Assassinate fails
> > >
> > >
> > >
> > > Hi Alex,
> > >
> > >
> > >
> > > About previous advices:
> > >
> > >
> > >
> > > You might have inconsistent data in your system tables.  Try setting the 
> > > consistency level to ALL, then do read query of system tables to force 
> > > repair.
> > >
> > >
> > >
> > > System tables use the 'LocalStrategy', thus I don't think any repair 
> > > would happen for the system.* tables. Regardless the consistency you use. 
> > > It should not harm, but I really think it won't help.
> > >
> > >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
> >
> > -
> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: user-h...@cassandra.apache.org
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Assassinate fails

2019-04-04 Thread Jon Haddad
System != system_auth.

On Thu, Apr 4, 2019 at 9:43 AM Kenneth Brotman
 wrote:
>
> From Mastering Cassandra:
>
>
> Forcing read repairs at consistency – ALL
>
> The type of repair isn't really part of the Apache Cassandra repair paradigm 
> at all. When it was discovered that a read repair will trigger 100% of the 
> time when a query is run at ALL consistency, this method of repair started to 
> gain popularity in the community. In some cases, this method of forcing data 
> consistency provided better results than normal, scheduled repairs.
>
> Let's assume, for a second, that an application team is having a hard time 
> logging into a node in a new data center. You try to cqlsh out to these 
> nodes, and notice that you are also experiencing intermittent failures, 
> leading you to suspect that the system_auth tables might be missing a replica 
> or two. On one node you do manage to connect successfully using cqlsh. One 
> quick way to fix consistency on the system_auth tables is to set consistency 
> to ALL, and run an unbound SELECT on every table, tickling each record:
>
> use system_auth ;
> consistency ALL;
> consistency level set to ALL.
>
> SELECT COUNT(*) FROM resource_role_permissons_index ;
> SELECT COUNT(*) FROM role_permissions ;
> SELECT COUNT(*) FROM role_members ;
> SELECT COUNT(*) FROM roles;
>
> This problem is often seen when logging in with the default cassandra user. 
> Within cqlsh, there is code that forces the default cassandra user to connect 
> by querying system_auth at QUORUM consistency. This can be problematic in 
> larger clusters, and is another reason why you should never use the default 
> cassandra user.
>
>
>
> -Original Message-
> From: Jon Haddad [mailto:j...@jonhaddad.com]
> Sent: Thursday, April 04, 2019 9:21 AM
> To: user@cassandra.apache.org
> Subject: Re: Assassinate fails
>
> Ken,
>
> Alain is right about the system tables.  What you're describing only
> works on non-local tables.  Changing the CL doesn't help with
> keyspaces that use LocalStrategy.  Here's the definition of the system
> keyspace:
>
> CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
> AND durable_writes = true;
>
> Jon
>
> On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
>  wrote:
> >
> > The trick below I got from the book Mastering Cassandra.  You have to set 
> > the consistency to ALL for it to work. I thought you guys knew that one.
> >
> >
> >
> > From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> > Sent: Thursday, April 04, 2019 8:46 AM
> > To: user cassandra.apache.org
> > Subject: Re: Assassinate fails
> >
> >
> >
> > Hi Alex,
> >
> >
> >
> > About previous advices:
> >
> >
> >
> > You might have inconsistent data in your system tables.  Try setting the 
> > consistency level to ALL, then do read query of system tables to force 
> > repair.
> >
> >
> >
> > System tables use the 'LocalStrategy', thus I don't think any repair would 
> > happen for the system.* tables. Regardless the consistency you use. It 
> > should not harm, but I really think it won't help.
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Assassinate fails

2019-04-04 Thread Jon Haddad
Ken,

Alain is right about the system tables.  What you're describing only
works on non-local tables.  Changing the CL doesn't help with
keyspaces that use LocalStrategy.  Here's the definition of the system
keyspace:

CREATE KEYSPACE system WITH replication = {'class': 'LocalStrategy'}
AND durable_writes = true;

Jon

On Thu, Apr 4, 2019 at 9:03 AM Kenneth Brotman
 wrote:
>
> The trick below I got from the book Mastering Cassandra.  You have to set the 
> consistency to ALL for it to work. I thought you guys knew that one.
>
>
>
> From: Alain RODRIGUEZ [mailto:arodr...@gmail.com]
> Sent: Thursday, April 04, 2019 8:46 AM
> To: user cassandra.apache.org
> Subject: Re: Assassinate fails
>
>
>
> Hi Alex,
>
>
>
> About previous advices:
>
>
>
> You might have inconsistent data in your system tables.  Try setting the 
> consistency level to ALL, then do read query of system tables to force repair.
>
>
>
> System tables use the 'LocalStrategy', thus I don't think any repair would 
> happen for the system.* tables. Regardless the consistency you use. It should 
> not harm, but I really think it won't help.
>
>

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cassandra Possible read/write race condition in LOCAL_ONE?

2019-03-28 Thread Jon Haddad
I'm reading the OP as doing this from a single server, if that's the
case QUORUM / LOCAL_QUORUM will work.

On Thu, Mar 28, 2019 at 3:29 PM Jeff Jirsa  wrote:
>
> Yes it can race; if you don't want to race, you'd want to use SERIAL or 
> LOCAL_SERIAL.
>
> On Thu, Mar 28, 2019 at 3:04 PM Richard Xin  
> wrote:
>>
>> Hi,
>> Our Cassandra Consistency level is currently set to LOCAL_ONE, we have 
>> script doing followings
>> 1) insert one record into table_A
>> 2) select last_inserted_record from table_A and do something ...
>>
>> step #1 & 2 are running sequentially without pause,  and I assume 1 & 2 
>> suppose to run in same DC
>>
>> we are facing sporadic issues that step #2 didnt get inserted data by #1.
>> is it possible to have a race condition when LOCAL_ONE that #2 might not get 
>> inserted data on step #1?
>>
>> Thanks in advance!
>> Richard

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Garbage Collector

2019-03-19 Thread Jon Haddad
G1 is optimized for high throughput with higher pause times.  It's great if
you have mixed / unpredictable workloads, and as Elliott mentioned is
mostly set & forget.

ZGC requires Java 11, which is only supported on trunk.  I plan on messing
with it soon, but I haven't had time yet.  We'll share the results on our
blog (TLP) when we get to it.

Jon

On Tue, Mar 19, 2019 at 10:12 AM Elliott Sims  wrote:

> I use G1, and I think it's actually the default now for newer Cassandra
> versions.  For G1, I've done very little custom config/tuning.  I increased
> heap to 16GB (out of 64GB physical), but most of the rest is at or near
> default.  For the most part, it's been "feed it more RAM, and it works"
> compared to CMS's "lower overhead, works great until it doesn't" and dozens
> of knobs.
>
> I haven't tried ZGC yet, but anecdotally I've heard that it doesn't really
> match or beat G1 quite yet.
>
> On Tue, Mar 19, 2019 at 9:44 AM Ahmed Eljami 
> wrote:
>
>> Hi Folks,
>>
>> Does someone use G1 GC or ZGC on production?
>>
>> Can you share your feedback, the configuration used if it's possible ?
>>
>> Thanks.
>>
>>


Re: Fw: read request is slow

2019-03-18 Thread Jon Haddad
Sjk plus is bundled with dse, not oss.

Grab it here if you need it:
https://github.com/aragozin/jvm-tools

On Mon, Mar 18, 2019 at 6:38 AM Dieudonné Madishon NGAYA 
wrote:

> Never mind : nodetool sjk help
>
> On Mon, Mar 18, 2019 at 9:36 AM Dieudonné Madishon NGAYA <
> dmng...@gmail.com> wrote:
>
>> Hi, please send us result of: nodetool help sjk
>>
>> On Mon, Mar 18, 2019 at 9:28 AM ishib...@gmail.com 
>> wrote:
>>
>>> Hi!
>>>
>>> This is interesting note for me.
>>>
>>> We have C* 3.11.1, but ‘sjk’ still  unexpected parameters for nodetool
>>> utility.
>>>
>>> Am I  missed something or ‘sjk’ available as separate rpm-package?
>>>
>>>
>>>
>>> Best regards, Ilya
>>>
>>>
>>> 
>>> Тема: Re: read request is slow
>>> От: Dieudonné Madishon NGAYA
>>> Кому: user@cassandra.apache.org
>>> Копия:
>>>
>>>
>>>
>>> For your information,since cassandra 3.0, it includes ttop and other
>>> options inside sjk
>>> nodetool sjk
>>>
>>> https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsSjk.html
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From:* Jon Haddad [mailto:j...@jonhaddad.com]
>>> *Sent:* Saturday, March 16, 2019 5:25 PM
>>> *To:* user@cassandra.apache.org
>>> *Subject:* Re: read request is slow
>>>
>>>
>>>
>>> I'm guessing you're getting 100MB from the comments in the config, which
>>> suggest 100MB per core.  This advice is pretty outdated and should be
>>> updated.
>>>
>>>
>>>
>>> I'd use 8GB total heap and 4GB new gen as a starting point.  I really
>>> suggest reading up on how GC works, I linked to a post in an earlier email.
>>>
>>>
>>>
>>> These are the flags you'd need to set in your jvm.options, or
>>> jvm-server.options depending on the version you're using:
>>>
>>>
>>>
>>> -Xmx8G
>>>
>>> -Xms8G
>>>
>>> -Xmn4G
>>>
>>>
>>>
>>> 1 core is probably going to be a problem, Cassandra creates a lot of
>>> threads and relies on doing work concurrently.  I wouldn't use less than 8
>>> cores in a production environment.
>>>
>>>
>>> ---
>>> This message and any attachment are confidential and may be privileged
>>> or otherwise protected from disclosure. If you are not the intended
>>> recipient any use, distribution, copying or disclosure is strictly
>>> prohibited. If you have received this message in error, please notify the
>>> sender immediately either by telephone or by e-mail and delete this message
>>> and any attachment from your system. Correspondence via e-mail is for
>>> information purposes only. AO Raiffeisenbank neither makes nor accepts
>>> legally binding statements by e-mail unless otherwise agreed.
>>> ---
>>>
>>> - To
>>> unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For
>>> additional commands, e-mail: user-h...@cassandra.apache.org
>>
>> --
>>
>> Best regards
>> _
>>
>> [image:
>> https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
>> <https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour>
>><https://twitter.com/dmnbigdata>   <https://www.instagram.com/>
>> <https://www.linkedin.com/in/dngaya/>
>>
>> *Dieudonne Madishon NGAYA*
>> Datastax, Cassandra Architect
>> *P: *7048580065
>> *w: *www.dmnbigdata.com
>> *E: *dmng...@dmnbigdata.com
>> *Private E: *dmng...@gmail.com
>> *A: *Charlotte,NC,28273, USA
>>
> --
>
> Best regards
> _
>
> [image:
> https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
> <https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour>
><https://twitter.com/dmnbigdata>   <https://www.instagram.com/>
> <https://www.linkedin.com/in/dngaya/>
>
> *Dieudonne Madishon NGAYA*
> Datastax, Cassandra Architect
> *P: *7048580065
> *w: *www.dmnbigdata.com
> *E: *dmng...@dmnbigdata.com
> *Private E: *dmng...@gmail.com
> *A: *Charlotte,NC,28273, USA
>


Re: read request is slow

2019-03-16 Thread Jon Haddad
I'm guessing you're getting 100MB from the comments in the config, which
suggest 100MB per core.  This advice is pretty outdated and should be
updated.

I'd use 8GB total heap and 4GB new gen as a starting point.  I really
suggest reading up on how GC works, I linked to a post in an earlier email.

These are the flags you'd need to set in your jvm.options, or
jvm-server.options depending on the version you're using:

-Xmx8G
-Xms8G
-Xmn4G

1 core is probably going to be a problem, Cassandra creates a lot of
threads and relies on doing work concurrently.  I wouldn't use less than 8
cores in a production environment.

On Sun, Mar 17, 2019 at 3:12 AM Dieudonné Madishon NGAYA 
wrote:

> Starting point for me: max_heap_size to 8gb and heap_newsize to 100mb.
> Then restart node by node then watch system.log to see if you are seeing G.C
>
> On Sat, Mar 16, 2019 at 9:56 AM Sundaramoorthy, Natarajan <
> natarajan_sundaramoor...@optum.com> wrote:
>
>> So you guys are suggesting
>>
>>
>>
>> MAX_HEAP_SIZE  by 8/12/16GB
>>
>>
>>
>> And
>>
>>
>>
>> HEAP_NEWSIZE to 100 MB
>>
>>
>>
>> And
>>
>>
>>
>> heap with 50% of that as a starting point? Hw do I do this?
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> *From:* Dieudonné Madishon NGAYA [mailto:dmng...@gmail.com]
>> *Sent:* Saturday, March 16, 2019 12:15 AM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: read request is slow
>>
>>
>>
>> I agreed with jon haddad , your MAX_HEAP_SIZE is very small. you have lot
>> of RAM (256 GB), you can start your  MAX_HEAP_SIZE  by 8GB and increase if
>> necessary.
>>
>> Since you have only 1 physical core if i understood , you can set your 
>> HEAP_NEWSIZE
>> to 100 MB
>>
>>
>>
>> Best regards
>>
>> _
>>
>>
>> [image:
>> https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour]
>> <https://www.facebook.com/DMN-BigData-371074727032197/?modal=admin_todo_tour>
>><https://twitter.com/dmnbigdata>   <https://www.instagram.com/>
>> <https://www.linkedin.com/in/dngaya/>
>>
>> *Dieudonne Madishon NGAYA*
>> Datastax, Cassandra Architect
>> *P: *7048580065
>> *w: *www.dmnbigdata.com
>> *E: *dmng...@dmnbigdata.com
>> *Private E: *dmng...@gmail.com
>> *A: *Charlotte,NC,28273, USA
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Mar 16, 2019 at 1:07 AM Jon Haddad  wrote:
>>
>> I can't say I've ever used 100MB new gen with Cassandra, but in my
>> experience I've found small new gen to be incredibly harmful for
>> performance.  It doesn't surprise me at all that you'd hit some serious GC
>> issues.  My guess is you're filling up the new gen very quickly and
>> promoting everything in very quick cycles, leading to memory fragmentation
>> and soon after full GCs.  2GB is a tiny heap and I would never, under any
>> circumstances, run a 2GB heap in a production environment.  I'd only use
>> under 8 GB in a circle CI free tier for integration tests.
>>
>>
>>
>> I suggest you use a minimum of 8, preferably 12-16GB of total heap with
>> 50% of that as a starting point.  There's a bunch of posts floating around
>> on the topic, here's one I wrote:
>> http://thelastpickle.com/blog/2018/04/11/gc-tuning.html
>>
>>
>>
>> Jon
>>
>>
>>
>> On Sat, Mar 16, 2019 at 5:49 PM Sundaramoorthy, Natarajan <
>> natarajan_sundaramoor...@optum.com> wrote:
>>
>> Here you go. Thanks
>>
>> - name: MAX_HEAP_SIZE
>>
>>   value: 2048M
>>
>> - name: MY_POD_NAMESPACE
>>
>>   valueFrom:
>>
>> fieldRef:
>>
>>   apiVersion: v1
>>
>>   fieldPath: metadata.namespace
>>
>> - name: HEAP_NEWSIZE
>>
>>   value: 100M
>>
>>
>>
>>
>>
>>
>>
>> *From:* Dieudonné Madishon NGAYA [mailto:dmng...@gmail.com]
>> *Sent:* Friday, March 15, 2019 11:18 PM
>> *To:* user@cassandra.apache.org
>> *Subject:* Re: read request is slow
>>
>>
>>
>> Is it possible to have these parameters from cassandra-env.sh if they are
>> set:
>>
>> MAX_HEAP_SIZE and HEAP_NEWSIZE
>>
>>
>>
>> Best regards
>>
>> ___

Re: read request is slow

2019-03-15 Thread Jon Haddad
1. What was the read request?  Are you fetching a single row, a million,
something else?
2. What are your GC settings?
3. What's the hardware in use?  What resources have been allocated to each
instance?
4. Did you see this issue after a single request or is the cluster under
heavy load?

If you're going to share a config it's much easier to read as an actual
text file rather than a double spaced paste into the ML.  In the future if
you could share a link to the yaml you might get more eyes on it.

Jon

On Sat, Mar 16, 2019 at 3:57 PM Sundaramoorthy, Natarajan <
natarajan_sundaramoor...@optum.com> wrote:

> 3 pod deployed in openshift. Read request timed out due to GC collection.
> Can you please look at below parameters and value to see if anything is out
> of place? Thanks
>
>
>
>
>
> cat cassandra.yaml
>
>
>
> num_tokens: 256
>
>
>
>
>
>
>
> hinted_handoff_enabled: true
>
>
>
> hinted_handoff_throttle_in_kb: 1024
>
>
>
> max_hints_delivery_threads: 2
>
>
>
> hints_directory: /cassandra_data/hints
>
>
>
> hints_flush_period_in_ms: 1
>
>
>
> max_hints_file_size_in_mb: 128
>
>
>
>
>
> batchlog_replay_throttle_in_kb: 1024
>
>
>
> authenticator: PasswordAuthenticator
>
>
>
> authorizer: AllowAllAuthorizer
>
>
>
> role_manager: CassandraRoleManager
>
>
>
> roles_validity_in_ms: 2000
>
>
>
>
>
> permissions_validity_in_ms: 2000
>
>
>
>
>
>
>
>
>
> partitioner: org.apache.cassandra.dht.Murmur3Partitioner
>
>
>
> data_file_directories:
>
> - /cassandra_data/data
>
>
>
> commitlog_directory: /cassandra_data/commitlog
>
>
>
> disk_failure_policy: stop
>
>
>
> commit_failure_policy: stop
>
>
>
> key_cache_size_in_mb:
>
>
>
> key_cache_save_period: 14400
>
>
>
>
>
>
>
> row_cache_size_in_mb: 0
>
>
>
> row_cache_save_period: 0
>
>
>
>
>
> counter_cache_size_in_mb:
>
>
>
> counter_cache_save_period: 7200
>
>
>
>
>
> saved_caches_directory: /cassandra_data/saved_caches
>
>
>
> commitlog_sync: periodic
>
> commitlog_sync_period_in_ms: 1
>
>
>
> commitlog_segment_size_in_mb: 32
>
>
>
>
>
> seed_provider:
>
> - class_name: org.apache.cassandra.locator.SimpleSeedProvider
>
>   parameters:
>
>   - seeds:
> "cassandra-0.cassandra.ihr-ei.svc.cluster.local,cassandra-1.cassandra.ihr-ei.svc.cluster.local"
>
>
>
> concurrent_reads: 32
>
> concurrent_writes: 32
>
> concurrent_counter_writes: 32
>
>
>
> concurrent_materialized_view_writes: 32
>
>
>
>
>
>
>
>
>
> disk_optimization_strategy: ssd
>
>
>
>
>
>
>
> memtable_allocation_type: heap_buffers
>
>
>
> commitlog_total_space_in_mb: 2048
>
>
>
>
>
> index_summary_capacity_in_mb:
>
>
>
> index_summary_resize_interval_in_minutes: 60
>
>
>
> trickle_fsync: false
>
> trickle_fsync_interval_in_kb: 10240
>
>
>
> storage_port: 7000
>
>
>
> ssl_storage_port: 7001
>
>
>
> listen_address: 10.130.7.245
>
>
>
> broadcast_address: 10.130.7.245
>
>
>
>
>
>
>
> start_native_transport: true
>
> native_transport_port: 9042
>
>
>
>
>
>
>
> start_rpc: true
>
>
>
> rpc_address: 0.0.0.0
>
>
>
> rpc_port: 9160
>
>
>
> broadcast_rpc_address: 10.130.7.245
>
>
>
> rpc_keepalive: true
>
>
>
> rpc_server_type: sync
>
>
>
>
>
>
>
>
>
> thrift_framed_transport_size_in_mb: 15
>
>
>
> incremental_backups: false
>
>
>
> snapshot_before_compaction: false
>
>
>
> auto_snapshot: true
>
>
>
> tombstone_warn_threshold: 1000
>
> tombstone_failure_threshold: 10
>
>
>
> column_index_size_in_kb: 64
>
>
>
>
>
> batch_size_warn_threshold_in_kb: 5
>
>
>
> batch_size_fail_threshold_in_kb: 50
>
>
>
>
>
> compaction_throughput_mb_per_sec: 16
>
>
>
> compaction_large_partition_warning_threshold_mb: 100
>
>
>
> sstable_preemptive_open_interval_in_mb: 50
>
>
>
>
>
>
>
> read_request_timeout_in_ms: 5
>
> range_request_timeout_in_ms: 10
>
> write_request_timeout_in_ms: 2
>
> counter_write_request_timeout_in_ms: 5000
>
> cas_contention_timeout_in_ms: 1000
>
> truncate_request_timeout_in_ms: 6
>
> request_timeout_in_ms: 10
>
>
>
> cross_node_timeout: false
>
>
>
>
>
> phi_convict_threshold: 12
>
>
>
> endpoint_snitch: GossipingPropertyFileSnitch
>
>
>
> dynamic_snitch_update_interval_in_ms: 100
>
> dynamic_snitch_reset_interval_in_ms: 60
>
> dynamic_snitch_badness_threshold: 0.1
>
>
>
> request_scheduler: org.apache.cassandra.scheduler.NoScheduler
>
>
>
>
>
>
>
> server_encryption_options:
>
> internode_encryption: none
>
> keystore: conf/.keystore
>
> truststore: conf/.truststore
>
>
>
> client_encryption_options:
>
> enabled: false
>
> optional: false
>
> keystore: conf/.keystore
>
>
>
> internode_compression: all
>
>
>
> inter_dc_tcp_nodelay: false
>
>
>
> tracetype_query_ttl: 86400
>
> tracetype_repair_ttl: 604800
>
>
>
> gc_warn_threshold_in_ms: 1000
>
>
>
> enable_user_defined_functions: false
>
>
>
> enable_scripted_user_defined_functions: false
>
>
>
> windows_timer_interval: 1
>
>
>
>
>
> auto_bootstrap: false
>
>
> This e-mail, including attachments, may include confidential and/or
> proprietary information, and may be used only by the person 

JVM Tuning post

2018-04-11 Thread Jon Haddad
Hey folks,

We (The Last Pickle) have helped a lot of teams with JVM tuning over the years, 
finally managed to write some stuff down.  We’re hoping the community finds it 
helpful. 

http://thelastpickle.com/blog/2018/04/11/gc-tuning.html 


Jon



Re: Text or....

2018-04-04 Thread Jon Haddad
Depending on the compression rate, I think it would generate less garbage on 
the Cassandra side if you compressed it client side.  Something to test out.


> On Apr 4, 2018, at 7:19 AM, Jeff Jirsa  wrote:
> 
> Compressing server side and validating checksums is hugely important in the 
> more frequently used versions of cassandra - so since you probably want to 
> run compression on the server anyway, I’m not sure why you’d compress it 
> twice 
> 
> -- 
> Jeff Jirsa
> 
> 
> On Apr 4, 2018, at 6:23 AM, DuyHai Doan  > wrote:
> 
>> Compressing client-side is better because it will save:
>> 
>> 1) a lot of bandwidth on the network
>> 2) a lot of Cassandra CPU because no decompression server-side
>> 3) a lot of Cassandra HEAP because the compressed blob should be relatively 
>> small (text data compress very well) compared to the raw size
>> 
>> On Wed, Apr 4, 2018 at 2:59 PM, Jeronimo de A. Barros 
>> > wrote:
>> Hi,
>> 
>> We use a pseudo file-system table where the chunks are blobs of 64 KB and we 
>> never had any performance issue.
>> 
>> Primary-key structure is ((file-uuid), chunck-id).
>> 
>> Jero
>> 
>> On Wed, Apr 4, 2018 at 9:25 AM, shalom sagges > > wrote:
>> Hi All, 
>> 
>> A certain application is writing ~55,000 characters for a single row. Most 
>> of these characters are entered to one column with "text" data type. 
>> 
>> This looks insanely large for one row. 
>> Would you suggest to change the data type from "text" to BLOB or any other 
>> option that might fit this scenario?
>> 
>> Thanks!
>> 
>> 



Backup & Restore w/ AWS Blog Post

2018-04-03 Thread Jon Haddad
Hey folks.  We (The Last Pickle) have helped a number of clients set up backup 
& restore on AWS over the last couple of years.  Alain has been working on a 
thorough blog post over the last several months to try to document pros, cons 
and techniques.  Hopefully it proves to be helpful to the community.  

http://thelastpickle.com/blog/2018/04/03/cassandra-backup-and-restore-aws-ebs.html
 


Jon

Re: nodetool repair and compact

2018-04-01 Thread Jon Haddad
You’ll find the answers to your questions (and quite a bit more) in this blog 
post from my coworker: 
http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html 


Repair doesn’t clean up tombstones, they’re only removed through compaction.  I 
advise taking care with nodetool compact, most of the time it’s not a great 
idea for a variety of reasons.  Check out the above post, if you still have 
questions, ask away.  


> On Apr 1, 2018, at 9:41 PM, Xiangfei Ni  wrote:
> 
> Hi All,
>   I want to delete the expired tombstone, someone uses nodetool repair ,but 
> someone uses compact,so I want to know which one is the correct way,
>   I have read the below pages from Datastax,but the page just tells us how to 
> use the command,but doesn’t tell us what it is exactly dose,
>   https://docs.datastax.com/en/cassandra/3.0/cassandra/tools/toolsRepair.html 
> 
>could anybody tell me how to clean the tombstone and give me some 
> materials include the detailed instruction about the nodetool command and 
> options?Web link is also ok.
>   Thanks very much
> Best Regards,
>  
> 倪项菲/ David Ni
> 中移德电网络科技有限公司
> Virtue Intelligent Network Ltd, co.
> 
> Add: 2003,20F No.35 Luojia creative city,Luoyu Road,Wuhan,HuBei
> Mob: +86 13797007811|Tel: + 86 27 5024 2516



Re: Fast Writes to Cassandra Failing Through Python Script

2018-03-15 Thread Jon Haddad
TWCS does SizeTieredCompaction within the window, so it’s not likely to make a 
difference.  I’m +1’ing what Jeff said, 128ms memtable_flush_period_in_ms is 
almost certainly your problem, unless you’ve changed other settings and haven’t 
told us about them.  

> On Mar 15, 2018, at 9:54 AM, Affan Syed  wrote:
> 
> Jeff, 
> 
> I think additionally the reason might also be that the keyspace was using  
> TimeWindowCompactionStrategy with 1 day bucket; however the writes very quite 
> rapid and no automatic compaction was working.
> 
> I would think changing strategy to SizeTiered would also solve this problem?
> 
> 
> 
> - Affan
> 
> On Thu, Mar 15, 2018 at 12:11 AM, Jeff Jirsa  > wrote:
> The problem was likely more with the fact that it can’t flush in 128ms so you 
> backup on flush
> 
> 
> -- 
> Jeff Jirsa
> 
> 
> On Mar 14, 2018, at 12:07 PM, Faraz Mateen  > wrote:
> 
>> I was able to overcome the timeout error by setting 
>> memtable_flush_period_in_ms to 0 for all my tables. Initially it was set to 
>> 128. 
>> Now I able to write ~4 records/min in cassandra and the script has been 
>> running for around 12 hours now.
>> 
>> However, I am still curious that why was cassandra unable to hold data in 
>> the memory for 128 ms considering that I have 30 GB of RAM for each node.
>> 
>> On Wed, Mar 14, 2018 at 2:24 PM, Faraz Mateen > > wrote:
>> Thanks for the response.
>> 
>> Here is the output of "DESCRIBE" on my table
>> 
>> https://gist.github.com/farazmateen/1c88f6ae4fb0b9f1619a2a1b28ae58c4
>>  
>> 
>> I am getting two errors from the python script that I mentioned above. First 
>> one does not show any error or exception in server logs. Second error:
>> 
>> "cassandra.OperationTimedOut: errors={'10.128.1.1': 'Client request timeout. 
>> See Session.execute[_async](timeout)'}, last_host=10.128.1.1"
>> 
>> shows JAVA HEAP Exception in server logs. You can look at the exception here:
>> 
>> https://gist.githubusercontent.com/farazmateen/e7aa5749f963ad2293f8be0ca1ccdc22/raw/e3fd274af32c20eb9f534849a31734dcd33745b4/JVM-HEAP-EXCEPTION.txt
>>  
>> 
>> 
>> My python code snippet can be viewed at the following link:
>> https://gist.github.com/farazmateen/02be8bb59cdb205d6a35e8e3f93e27d5
>>  
>> 
>>  
>> H ere 
>> are timeout related arguments from (/etc/cassandra/cassandra.yaml)
>> 
>> read_request_timeout_in_ms: 5000
>> range_request_timeout_in_ms: 1
>> write_request_timeout_in_ms: 1
>> counter_write_request_timeout_in_ms: 5000
>> cas_contention_timeout_in_ms: 1000
>> truncate_request_timeout_in_ms: 6
>> request_timeout_in_ms: 1
>> cross_node_timeout: false
>> 
>> 
>> On Wed, Mar 14, 2018 at 4:22 AM, Bruce Tietjen 
>> > 
>> wrote:
>> The following won't address any server performance issues, but will allow 
>> your application to continue to run even if there are client or server 
>> timeouts:
>> 
>> Your python code should wrap all Cassandra statement execution calls in 
>> a try/except block to catch any errors and handle them appropriately.
>> For timeouts, you might consider re-trying the statement.
>> 
>> You may also want to consider proactively setting your client and/or 
>> server timeouts so your application sees fewer failures.
>> 
>> 
>> Any production code should include proper error handling and during initial 
>> development and testing, it may be helpful to allow your application to 
>> continue running
>> so you get a better idea of if or when different timeouts occur.
>> 
>> see:
>>cassandra.Timeout
>>cassandra.WriteTimeout
>>cassandra.ReadTimeout
>> 
>> also:
>>https://datastax.github.io/python-driver/api/cassandra.html 
>> 
>> 
>> 
>> 
>> 
>> 
>> On Tue, Mar 13, 2018 at 5:17 PM, Goutham reddy > > wrote:
>> Faraz,
>> Can you share your code snippet, how you are trying to save the  entity 
>> objects into cassandra.
>> 
>> Thanks and Regards,
>> Goutham Reddy Aenugu.
>> 
>> Regards
>> Goutham Reddy
>> 
>> On Tue, Mar 13, 2018 at 3:42 PM, Faraz Mateen > > wrote:
>> Hi everyone,
>> 
>> I seem to have hit a problem in which writing to cassandra through a python 
>> script fails and also occasionally causes cassandra node to crash. Here are 
>> 

Re: What versions should the documentation support now?

2018-03-12 Thread Jon Haddad
Docs for 3.0 go in the 3.0 branch.

I’ve never heard of anyone shipping docs for multiple versions, I don’t know 
why we’d do that.  You can get the docs for any version you need by downloading 
C*, the docs are included.  I’m a firm -1 on changing that process.

Jon

> On Mar 12, 2018, at 9:19 AM, Kenneth Brotman  
> wrote:
> 
> It seems like the documentation that should be in the trunk for version 3.0, 
> should include information for users of version 3.0 and 2.1; the 
> documentation that should in 4.0 (when its released), should include 
> information for users of 4.0 and at least one previous version, etc. 
>  
> How about if we do it that way?
>  
> Kenneth Brotman
>  
> From: Jonathan Haddad [mailto:j...@jonhaddad.com] 
> Sent: Monday, March 12, 2018 9:10 AM
> To: user@cassandra.apache.org
> Subject: Re: What versions should the documentation support now?
>  
> Right now they can’t.
> On Mon, Mar 12, 2018 at 9:03 AM Kenneth Brotman  > wrote:
>> I see how that makes sense Jon but how does a user then select the 
>> documentation for the version they are running on the Apache Cassandra web 
>> site?
>>  
>> Kenneth Brotman
>>  
>> From: Jonathan Haddad [mailto:j...@jonhaddad.com 
>> ] 
>> Sent: Monday, March 12, 2018 8:40 AM
>> 
>> To: user@cassandra.apache.org 
>> Subject: Re: What versions should the documentation support now?
>>  
>> The docs are in tree, meaning they are versioned, and should be written for 
>> the version they correspond to. Trunk docs should reflect the current state 
>> of trunk, and shouldn’t have caveats for other versions. 
>> On Mon, Mar 12, 2018 at 8:15 AM Kenneth Brotman > > wrote:
>>> If we use DataStax’s example, we would have instructions for v3.0 and v2.1. 
>>>  How’s that?  
>>>  
>>> We should have to be instructions for the cloud platforms like AWS but how 
>>> do you do that and stay vendor neutral?
>>>  
>>> Kenneth Brotman
>>>  
>>> From: Hannu Kröger [mailto:hkro...@gmail.com ] 
>>> Sent: Monday, March 12, 2018 7:40 AM
>>> To: user@cassandra.apache.org 
>>> Subject: Re: What versions should the documentation support now?
>>>  
>>> In my opinion, a good documentation should somehow include version specific 
>>> pieces of information. Whether it is nodetool command that came in certain 
>>> version or parameter for something or something else.
>>>  
>>> That would very useful. It’s confusing if I see documentation talking about 
>>> 4.0 specifics and I try to find that in my 3.11.x
>>>  
>>> Hannu
>>>  
>>> 
>>> On 12 Mar 2018, at 16:38, Kenneth Brotman >> > wrote:
>>>  
>>> I’m unclear what versions are most popular right now? What version are you 
>>> running?
>>>  
>>> What version should still be supported in the documentation?  For example, 
>>> I’m turning my attention back to writing a section on adding a data center. 
>>>  What versions should I support in that information?
>>>  
>>> I’m working on it right now.  Thanks,
>>>  
>>> Kenneth Brotman



Re: Adding disk to operating C*

2018-03-09 Thread Jon Haddad
I agree with Jeff - I usually advise teams to cap their density around 3TB, 
especially with TWCS.  Read heavy workloads tend to use smaller datasets and 
ring size ends up being a function of performance tuning.

Since 2.2 bootstrap can now be resumed, which helps quite a bit with the 
streaming problem, see CASSANDRA-8838.

Jon


> On Mar 9, 2018, at 7:39 AM, Jeff Jirsa  wrote:
> 
> 1.5 TB sounds very very conservative - 3-4T is where I set the limit at past 
> jobs. Have heard of people doing twice that (6-8T). 
> 
> -- 
> Jeff Jirsa
> 
> 
> On Mar 8, 2018, at 11:09 PM, Niclas Hedhman  > wrote:
> 
>> I am curious about the side comment; "Depending on your usecase you may not
>> want to have a data density over 1.5 TB per node."
>> 
>> Why is that? I am planning much bigger than that, and now you give me
>> pause...
>> 
>> 
>> Cheers
>> Niclas
>> 
>> On Wed, Mar 7, 2018 at 6:59 PM, Rahul Singh > > wrote:
>> Are you putting both the commitlogs and the Sstables on the adds? Consider 
>> moving your snapshots often if that’s also taking up space. Maybe able to 
>> save some space before you add drives.
>> 
>> You should be able to add these new drives and mount them without an issue. 
>> Try to avoid different number of data dirs across nodes. It makes automation 
>> of operational processes a little harder.
>> 
>> As an aside, Depending on your usecase you may not want to have a data 
>> density over 1.5 TB per node.
>> 
>> --
>> Rahul Singh
>> rahul.si...@anant.us 
>> 
>> Anant Corporation
>> 
>> On Mar 7, 2018, 1:26 AM -0500, Eunsu Kim > >, wrote:
>>> Hello,
>>> 
>>> I use 5 nodes to create a cluster of Cassandra. (SSD 1TB)
>>> 
>>> I'm trying to mount an additional disk(SSD 1TB) on each node because each 
>>> disk usage growth rate is higher than I expected. Then I will add the the 
>>> directory to data_file_directories in cassanra.yaml
>>> 
>>> Can I get advice from who have experienced this situation?
>>> If we go through the above steps one by one, will we be able to complete 
>>> the upgrade without losing data?
>>> The replication strategy is SimpleStrategy, RF 2.
>>> 
>>> Thank you in advance
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>>> 
>>> For additional commands, e-mail: user-h...@cassandra.apache.org 
>>> 
>>> 
>> 
>> 
>> 
>> -- 
>> Niclas Hedhman, Software Developer
>> http://zest.apache.org  - New Energy for Java



Re: Filling in the blank To Do sections on the Apache Cassandra web site

2018-02-27 Thread Jon Haddad
There’s a section dedicated to contributing to Cassandra documentation in the 
docs as well: 
https://cassandra.apache.org/doc/latest/development/documentation.html 
<https://cassandra.apache.org/doc/latest/development/documentation.html>


> On Feb 27, 2018, at 9:55 AM, Kenneth Brotman <kenbrot...@yahoo.com.INVALID> 
> wrote:
> 
> I was just getting ready to install sphinx.  Cool.  
>  
> From: Jon Haddad [mailto:jonathan.had...@gmail.com] On Behalf Of Jon Haddad
> Sent: Tuesday, February 27, 2018 9:51 AM
> To: user@cassandra.apache.org
> Subject: Re: Filling in the blank To Do sections on the Apache Cassandra web 
> site
>  
> The docs have been in tree for years :)
>  
> https://github.com/apache/cassandra/tree/trunk/doc 
> <https://github.com/apache/cassandra/tree/trunk/doc>
>  
> There’s even a docker image to build them so you don’t need to mess with 
> sphinx.  Check the README for instructions.
>  
> Jon
> 
> 
> On Feb 27, 2018, at 9:49 AM, Carl Mueller <carl.muel...@smartthings.com 
> <mailto:carl.muel...@smartthings.com>> wrote:
>  
> If there was a github for the docs, we could start posting content to it for 
> review. Not sure what the review/contribution process is on Apache. Google 
> searches on apache documentation and similar run into lots of noise from 
> actual projects.
> 
> I wouldn't mind trying to do a little doc work on the regular if there was a 
> wiki, a proven means to do collaborative docs. 
> 
>  
> On Tue, Feb 27, 2018 at 11:42 AM, Kenneth Brotman 
> <kenbrot...@yahoo.com.invalid <mailto:kenbrot...@yahoo.com.invalid>> wrote:
> It’s just content for web pages.  There isn’t a working outline or any draft 
> on any of the JIRA’s yet.  I like to keep things simple.  Did I miss 
> something?  What does it matter right now?
>  
> Thanks Carl,
>  
> Kenneth Brotman
>  
> From: Carl Mueller [mailto:carl.muel...@smartthings.com 
> <mailto:carl.muel...@smartthings.com>] 
> Sent: Tuesday, February 27, 2018 8:50 AM
> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
> Subject: Re: Filling in the blank To Do sections on the Apache Cassandra web 
> site
>  
> so... are those pages in the code tree of github? I don't see them or a 
> directory structure under /doc. Is mirroring the documentation between the 
> apache site and a github source a big issue?
>  
> On Tue, Feb 27, 2018 at 7:50 AM, Kenneth Brotman 
> <kenbrot...@yahoo.com.invalid <mailto:kenbrot...@yahoo.com.invalid>> wrote:
> I was debating that.  Splitting it up into smaller tasks makes each one seem 
> less over-whelming.  
>  
> Kenneth Brotman
>  
> From: Josh McKenzie [mailto:jmckenzie@apacheorg 
> <mailto:jmcken...@apache.org>] 
> Sent: Tuesday, February 27, 2018 5:44 AM
> To: cassandra
> Subject: Re: Filling in the blank To Do sections on the Apache Cassandra web 
> site
>  
> Might help, organizationally, to put all these efforts under a single ticket 
> of "Improve web site Documentation" and add these as sub-tasks Should be able 
> to do that translation post-creation (i.e. in its current state) if that's 
> something that makes sense to you.
>  
> On Mon, Feb 26, 2018 at 5:24 PM, Kenneth Brotman 
> <kenbrot...@yahoo.com.invalid <mailto:kenbrot...@yahoo.com.invalid>> wrote:
> Here are the related JIRA’s.  Please add content even if It’s not well formed 
> compositionally.  Myself or someone else will take it from there
>  
> https://issues.apache.org/jira/browse/CASSANDRA-14274 
> <https://issues.apache.org/jira/browse/CASSANDRA-14274>  The troubleshooting 
> section of the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14273 
> <https://issues.apache.org/jira/browse/CASSANDRA-14273>  The Bulk Loading web 
> page on the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14272 
> <https://issues.apache.org/jira/browse/CASSANDRA-14272>  The Backups web page 
> on the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14271 
> <https://issues.apache.org/jira/browse/CASSANDRA-14271>  The Hints web page 
> in the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14270 
> <https://issues.apache.org/jira/browse/CASSANDRA-14270>  The Read repair web 
> page is empty
> https://issuesapache.org/jira/browse/CASSANDRA-14269 
> <https://issues.apache.org/jira/browse/CASSANDRA-14269>  The Data Modeling 
> section of the web site is empty
> https://issues.apache.org/jira/browse/CASSANDRA-14268 
> <https://issues.apache.org/jira/browse/CASSANDRA-14268>  The 
> Architecture:Guarantees web page is empty
> https://i

Re: Filling in the blank To Do sections on the Apache Cassandra web site

2018-02-27 Thread Jon Haddad
The docs have been in tree for years :)

https://github.com/apache/cassandra/tree/trunk/doc 


There’s even a docker image to build them so you don’t need to mess with 
sphinx.  Check the README for instructions.

Jon

> On Feb 27, 2018, at 9:49 AM, Carl Mueller  
> wrote:
> 
> If there was a github for the docs, we could start posting content to it for 
> review. Not sure what the review/contribution process is on Apache. Google 
> searches on apache documentation and similar run into lots of noise from 
> actual projects.
> 
> I wouldn't mind trying to do a little doc work on the regular if there was a 
> wiki, a proven means to do collaborative docs. 
> 
> 
> On Tue, Feb 27, 2018 at 11:42 AM, Kenneth Brotman 
> > wrote:
> It’s just content for web pages.  There isn’t a working outline or any draft 
> on any of the JIRA’s yet.  I like to keep things simple.  Did I miss 
> something?  What does it matter right now?
> 
>  
> 
> Thanks Carl,
> 
>  
> 
> Kenneth Brotman
> 
>  
> 
> From: Carl Mueller [mailto:carl.muel...@smartthings.com 
> ] 
> Sent: Tuesday, February 27, 2018 8:50 AM
> To: user@cassandra.apache.org 
> Subject: Re: Filling in the blank To Do sections on the Apache Cassandra web 
> site
> 
>  
> 
> so... are those pages in the code tree of github? I don't see them or a 
> directory structure under /doc. Is mirroring the documentation between the 
> apache site and a github source a big issue?
> 
>  
> 
> On Tue, Feb 27, 2018 at 7:50 AM, Kenneth Brotman 
> > wrote:
> 
> I was debating that.  Splitting it up into smaller tasks makes each one seem 
> less over-whelming. 
> 
>  
> 
> Kenneth Brotman
> 
>  
> 
> From: Josh McKenzie [mailto:jmckenzie@apacheorg 
> ] 
> Sent: Tuesday, February 27, 2018 5:44 AM
> To: cassandra
> Subject: Re: Filling in the blank To Do sections on the Apache Cassandra web 
> site
> 
>  
> 
> Might help, organizationally, to put all these efforts under a single ticket 
> of "Improve web site Documentation" and add these as sub-tasks Should be able 
> to do that translation post-creation (i.e. in its current state) if that's 
> something that makes sense to you.
> 
>  
> 
> On Mon, Feb 26, 2018 at 5:24 PM, Kenneth Brotman 
> > wrote:
> 
> Here are the related JIRA’s.  Please add content even if It’s not well formed 
> compositionally.  Myself or someone else will take it from there
> 
>  
> 
> https://issues.apache.org/jira/browse/CASSANDRA-14274 
>   The troubleshooting 
> section of the web site is empty
> 
> https://issues.apache.org/jira/browse/CASSANDRA-14273 
>   The Bulk Loading web 
> page on the web site is empty
> 
> https://issues.apache.org/jira/browse/CASSANDRA-14272 
>   The Backups web page 
> on the web site is empty
> 
> https://issues.apache.org/jira/browse/CASSANDRA-14271 
>   The Hints web page 
> in the web site is empty
> 
> https://issues.apache.org/jira/browse/CASSANDRA-14270 
>   The Read repair web 
> page is empty
> 
> https://issuesapache.org/jira/browse/CASSANDRA-14269 
>   The Data Modeling 
> section of the web site is empty
> 
> https://issues.apache.org/jira/browse/CASSANDRA-14268 
>   The 
> Architecture:Guarantees web page is empty
> 
> https://issuesapache.org/jira/browse/CASSANDRA-14267 
>   The Dynamo web page 
> on the Apache Cassandra site is missing content
> 
> https://issues.apache.org/jira/browse/CASSANDRA-14266 
>   The Architecture 
> Overview web page on the Apache Cassandra site is empty
> 
>  
> 
> Thanks for pitching in 
> 
>  
> 
> Kenneth Brotman
> 
>  
> 
> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com.INVALID 
> ] 
> Sent: Monday, February 26, 2018 1:54 PM
> To: user@cassandra.apache.org 
> Subject: RE: Filling in the blank To Do sections on the Apache Cassandra web 
> site
> 
>  
> 
> Nice!  Thanks for the help Oliver!
> 
>  
> 
> Kenneth Brotman
> 
>  
> 
> From: Oliver Ruebenacker [mailto:cur...@gmail.com ] 
> Sent: Sunday, February 25, 2018 7:12 AM
> To: user@cassandra.apache.org 
> Cc: dev@cassandra.apacheorg 

Re: How to Parse raw CQL text?

2018-02-26 Thread Jon Haddad
Yes ideally.  I’ve been spending a bit of time in the parser the last week.  
There’s a lot of internals which are still using old terminology and are pretty 
damn confusing.  I’m doing a little investigation into exposing some of the 
information while also modernizing it.  


> On Feb 26, 2018, at 10:02 AM, Hannu Kröger  wrote:
> 
> If this is needed functionality, shouldn’t that be available as a public 
> method or something? Maybe write a patch etc. ?
> 
> Ariel Weisberg > kirjoitti 
> 26.2.2018 kello 18.47:
> 
>> Hi,
>> 
>> I took a similar approach and it worked fine. I was able to build a tool 
>> that parsed production query logs.
>> 
>> I used a helper method that would just grab a private field out of an object 
>> by name using reflection.
>> 
>> Ariel
>> 
>> On Sun, Feb 25, 2018, at 11:58 PM, Jonathan Haddad wrote:
>>> I had to do something similar recently.  Take a look at 
>>> org.apache.cassandra.cql3.QueryProcessor.parseStatement().  I've got some 
>>> sample code here [1] as well as a blog post [2] that explains how to access 
>>> the private variables, since there's no access provided.  It wasn't really 
>>> designed to be used as a library, so YMMV with future changes.  
>>> 
>>> [1] 
>>> https://github.com/rustyrazorblade/rustyrazorblade-examples/blob/master/privatevaraccess/src/main/kotlin/com/rustyrazorblade/privatevaraccess/CreateTableParser.kt
>>>  
>>> 
>>> [2] 
>>> http://rustyrazorblade.com/post/2018/2018-02-25-accessing-private-variables-in-jvm/
>>>  
>>> 
>>> 
>>> On Mon, Feb 5, 2018 at 2:27 PM Kant Kodali >> > wrote:
>>> I just did some trial and error. Looks like this would work
>>> 
>>> public class Test {
>>> 
>>> 
>>> 
>>> public static void main(String[] args) throws Exception {
>>> 
>>> String stmt = "create table if not exists test_keyspace.my_table 
>>> (field1 text, field2 int, field3 set, field4 map, 
>>> primary key (field1) );";
>>> 
>>> ANTLRStringStream stringStream = new ANTLRStringStream(stmt);
>>> 
>>> CqlLexer cqlLexer = new CqlLexer(stringStream);
>>> 
>>> CommonTokenStream token = new CommonTokenStream(cqlLexer);
>>> 
>>> CqlParser parser = new CqlParser(token);
>>> 
>>> ParsedStatement query = parser.cqlStatement();
>>> 
>>> 
>>> if (query.getClass().getDeclaringClass() == 
>>> CreateTableStatement.class) {
>>> 
>>> CreateTableStatement.RawStatement cts = 
>>> (CreateTableStatement.RawStatement) query;
>>> 
>>> CFMetaData
>>> 
>>> .compile(stmt, cts.keyspace())
>>> 
>>> 
>>> 
>>> .getColumnMetadata()
>>> 
>>> .values()
>>> 
>>> .stream()
>>> 
>>> .forEach(cd -> System.out.println(cd));
>>> 
>>> 
>>> }
>>>}
>>> }
>>> 
>>> On Mon, Feb 5, 2018 at 2:13 PM, Kant Kodali >> > wrote:
>>> Hi Anant,
>>> 
>>> I just have CQL create table statement as a string I want to extract all 
>>> the parts like, tableName, KeySpaceName, regular Columns,  partitionKey, 
>>> ClusteringKey, Clustering Order and so on. Thats really  it!
>>> 
>>> Thanks!
>>> 
>>> On Mon, Feb 5, 2018 at 1:50 PM, Rahul Singh >> > wrote:
>>> I think I understand what you are trying to do … but what is your goal? 
>>> What do you mean “use it for different” queries… Maybe you want to do an 
>>> event and have an event processor? Seems like you are trying to basically 
>>> by pass that pattern and parse a query and split it into several actions? 
>>> 
>>> Did you look into this unit test folder? 
>>> 
>>> https://github.com/apache/cassandra/blob/trunk/test/unit/org/apache/cassandra/cql3/CQLTester.java
>>>  
>>> 
>>> 
>>> --
>>> Rahul Singh
>>> rahul.si...@anant.us 
>>> 
>>> Anant Corporation
>>> 
>>> On Feb 5, 2018, 4:06 PM -0500, Kant Kodali >> >, wrote:
>>> 
 Hi All,
 
 I have a need where I get a raw CQL create table statement as a String and 
 I need to parse the keyspace, tablename, columns and so on..so I can use 
 it for various queries and send it to C*. I used the example below from 
 this link . I get the 
 following error.  And I thought maybe someone in this mailing list will be 
 more familiar with internals.  
 
 Exception in thread "main" 
 

Re: Gathering / Curating / Organizing Cassandra Best Practices & Patterns

2018-02-24 Thread Jon Haddad
DataStax academy is great but no, no work needs to be or should be aligned with 
it.  Datastax is an independent company trying to make a profit, they could 
yank their docs at any time.  There’s a reason why we started doing the docs 
in-tree, there was too much of a reliance on DS documentation.

DataStax isn’t Cassandra.

> On Feb 24, 2018, at 10:42 AM, Kenneth Brotman  
> wrote:
> 
> Any efforts described below should be aligned with, complement, enhance, fill 
> in the outstanding work of DataStax Academy. 
>  
> Kenneth Brotman
>  
> From: Kenneth Brotman [mailto:kenbrot...@yahoo.com 
> ] 
> Sent: Saturday, February 24, 2018 10:16 AM
> To: 'user@cassandra.apache.org '
> Subject: RE: Gathering / Curating / Organizing Cassandra Best Practices & 
> Patterns
>  
> To Rahul,
>  
> This is your official email (just from me as an individual) requesting your 
> assistance to help solve the knowledge management problem. I can appreciate 
> the work you put into the Awesome Cassandra list.  It is difficult to keep 
> everything up to date.  I’ve been there too.
>  
> The golden trophy if you want to do the absolute best thing is a full-fledged 
> professional development initiative for Cassandra.   From an instructional 
> design view, what you do is create a body of knowledge and exhaustive list of 
> competencies, some call KSA’s: knowledge, skills and abilities; then you do a 
> gap analysis to find the areas in practice where gaps exists between the 
> competencies desired and those of practitioners, then generate a mix of media 
> for difference learning styles in a structured properly sequenced series of 
> easy to work through steps complete with apperception exercises, and everyone 
> will then have a smooth path towards mastery.  It’s that easy.
>  
> So, yes let’s turn it up a few notches.
>  
> Thank you,
>  
> Kenneth Brotman
>  
> 
> --
> Rahul Singh
> rahul.si...@anant.us 
> 
> Anant Corporation
> 
> On Feb 23, 2018, 5:56 PM -0500, Carl Mueller  >, wrote:
> 
> Isn't a github markdown site about the most easiest collaborative platform 
> there is for stuff like this? I'm not saying the end product will knock 
> anyone's socks off.
>  
> On Thu, Feb 22, 2018 at 10:55 AM, Rahul Singh  > wrote:
> There’s always a reason to complain if you aren’t paying for something. 
> There’s always a reason to complain if you are paying for something. 
>  
> TLDR; If you want to help curate / organize / gather knowledge about 
> Cassandra, send me an email. I’d love to solve at least the knowledge 
> management problem. 
> 
> Complaining itself is not a solution or a step in the right direction. 
> Defining an issue helps by identifying specifically what the pain is and a 
> decision can be made to resolve or not resolve it.



Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-24 Thread Jon Haddad
We don’t have this documented *anywhere* right now, I’ve created a JIRA to 
update the site with the relevant info on this topic: 
https://issues.apache.org/jira/browse/CASSANDRA-14258 
<https://issues.apache.org/jira/browse/CASSANDRA-14258>



> On Feb 24, 2018, at 7:44 AM, Jon Haddad <j...@jonhaddad.com> wrote:
> 
> You can’t migrate down that way.  The last several nodes you have up will get 
> completely overwhelmed, and you’ll be completely screwed.  Please do not give 
> advice like this unless you’ve actually gone through the process or at least 
> have an understanding of how the data will be shifted.  Adding nodes with 16 
> tokens while decommissioning the ones with 256 will be absolute hell.
> 
> You can only do this by adding a new DC and retiring the old.
> 
>> On Feb 24, 2018, at 2:26 AM, Kyrylo Lebediev <kyrylo_lebed...@epam.com 
>> <mailto:kyrylo_lebed...@epam.com>> wrote:
>> 
>> > By the way, is it possible to migrate towards to smaller token ranges? 
>> > What is the recommended way doing so?
>>  - Didn't see this question answered. I think, be easiest way to do this is 
>> to add new C* nodes with lower vnodes (8, 16 instead of default 256) then 
>> decom old nodes with vnodes=256.
>> 
>> Thanks, guys, for shedding some light on this Java multithread-related 
>> scalability issue. BTW how to understand from JVM / OS metrics that number 
>> of threads for a JVM becomes a bottleneck? 
>> 
>> Also, I'd like to add a comment: the higher number of vnodes per a node the 
>> lower overall reliability of the cluster. Replicas for a token range  are 
>> placed on the nodes responsible for next+1, next+2  ranges  (not taking into 
>> account NetworkTopologyStrategy / Snitch which help but seemingly not so 
>> much expressing in terms of probabilities). The higher number of vnodes per 
>> a node, the higher probability all nodes in the cluster will become 
>> 'neighbors' in terms of token ranges.
>> It's not a trivial formula for 'reliability' of C* cluster [haven't got a 
>> chance to do math], but in general, having a bigger number of nodes in a 
>> cluster (like 100 or 200), probability of 2 or more nodes are down at the 
>> same time increases proportionally the the number of nodes.  
>> 
>> The most reliable C* setup is using initial_token instead of vnodes. 
>> But this makes manageability of C* cluster worse [not so automated + there 
>> will hotshots in the cluster in some cases]. 
>> 
>> Remark: for  C* cluster with RF=3 any number of nodes and 
>> initial_token/vnodes setup there is always a possibility that simultaneous 
>> unavailability of 2(or 3, depending on which CL is used) nodes will lead to 
>> unavailability of a token range ('HostUnavailable' exception). 
>> No miracles: reliability is mostly determined by RF number. 
>> 
>> Which task must be solved for large clusters: "Reliability of a cluster with 
>> NNN nodes and RF=3 shouldn't be 'tangibly' less than reliability of 3-nodes 
>> cluster with RF=3"
>> 
>> Kind Regards, 
>> Kyrill
>> From: Jürgen Albersdorfer <jalbersdor...@gmail.com 
>> <mailto:jalbersdor...@gmail.com>>
>> Sent: Tuesday, February 20, 2018 10:34:21 PM
>> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
>> Subject: Re: Is it possible / makes it sense to limit concurrent streaming 
>> during bootstrapping new nodes?
>>  
>> Thanks Jeff,
>> your answer is really not what I expected to learn - which is again more 
>> manual doing as soon as we start really using C*. But I‘m happy to be able 
>> to learn it now and have still time to learn the neccessary Skills and ask 
>> the right questions on how to correctly drive big data with C* until we 
>> actually start using it, and I‘m glad to have People like you around caring 
>> about this questions. Thanks. This still convinces me having bet on the 
>> right horse, even when it might become a rough ride.
>> 
>> By the way, is it possible to migrate towards to smaller token ranges? What 
>> is the recommended way doing so? And which number of nodes is the typical 
>> ‚break even‘?
>> 
>> Von meinem iPhone gesendet
>> 
>> Am 20.02.2018 um 21:05 schrieb Jeff Jirsa <jji...@gmail.com 
>> <mailto:jji...@gmail.com>>:
>> 
>>> The scenario you describe is the typical point where people move away from 
>>> vnodes and towards single-token-per-node (or a much smaller number of 
>>> vnodes).
>>> 
>>> The default setting puts you in a situation where virtually all hosts are 

Re: Is it possible / makes it sense to limit concurrent streaming during bootstrapping new nodes?

2018-02-24 Thread Jon Haddad
You can’t migrate down that way.  The last several nodes you have up will get 
completely overwhelmed, and you’ll be completely screwed.  Please do not give 
advice like this unless you’ve actually gone through the process or at least 
have an understanding of how the data will be shifted.  Adding nodes with 16 
tokens while decommissioning the ones with 256 will be absolute hell.

You can only do this by adding a new DC and retiring the old.

> On Feb 24, 2018, at 2:26 AM, Kyrylo Lebediev  wrote:
> 
> > By the way, is it possible to migrate towards to smaller token ranges? What 
> > is the recommended way doing so?
>  - Didn't see this question answered. I think, be easiest way to do this is 
> to add new C* nodes with lower vnodes (8, 16 instead of default 256) then 
> decom old nodes with vnodes=256.
> 
> Thanks, guys, for shedding some light on this Java multithread-related 
> scalability issue. BTW how to understand from JVM / OS metrics that number of 
> threads for a JVM becomes a bottleneck? 
> 
> Also, I'd like to add a comment: the higher number of vnodes per a node the 
> lower overall reliability of the cluster. Replicas for a token range  are 
> placed on the nodes responsible for next+1, next+2  ranges  (not taking into 
> account NetworkTopologyStrategy / Snitch which help but seemingly not so much 
> expressing in terms of probabilities). The higher number of vnodes per a 
> node, the higher probability all nodes in the cluster will become 'neighbors' 
> in terms of token ranges.
> It's not a trivial formula for 'reliability' of C* cluster [haven't got a 
> chance to do math], but in general, having a bigger number of nodes in a 
> cluster (like 100 or 200), probability of 2 or more nodes are down at the 
> same time increases proportionally the the number of nodes.  
> 
> The most reliable C* setup is using initial_token instead of vnodes. 
> But this makes manageability of C* cluster worse [not so automated + there 
> will hotshots in the cluster in some cases]. 
> 
> Remark: for  C* cluster with RF=3 any number of nodes and 
> initial_token/vnodes setup there is always a possibility that simultaneous 
> unavailability of 2(or 3, depending on which CL is used) nodes will lead to 
> unavailability of a token range ('HostUnavailable' exception). 
> No miracles: reliability is mostly determined by RF number. 
> 
> Which task must be solved for large clusters: "Reliability of a cluster with 
> NNN nodes and RF=3 shouldn't be 'tangibly' less than reliability of 3-nodes 
> cluster with RF=3"
> 
> Kind Regards, 
> Kyrill
> From: Jürgen Albersdorfer 
> Sent: Tuesday, February 20, 2018 10:34:21 PM
> To: user@cassandra.apache.org
> Subject: Re: Is it possible / makes it sense to limit concurrent streaming 
> during bootstrapping new nodes?
>  
> Thanks Jeff,
> your answer is really not what I expected to learn - which is again more 
> manual doing as soon as we start really using C*. But I‘m happy to be able to 
> learn it now and have still time to learn the neccessary Skills and ask the 
> right questions on how to correctly drive big data with C* until we actually 
> start using it, and I‘m glad to have People like you around caring about this 
> questions. Thanks. This still convinces me having bet on the right horse, 
> even when it might become a rough ride.
> 
> By the way, is it possible to migrate towards to smaller token ranges? What 
> is the recommended way doing so? And which number of nodes is the typical 
> ‚break even‘?
> 
> Von meinem iPhone gesendet
> 
> Am 20.02.2018 um 21:05 schrieb Jeff Jirsa  >:
> 
>> The scenario you describe is the typical point where people move away from 
>> vnodes and towards single-token-per-node (or a much smaller number of 
>> vnodes).
>> 
>> The default setting puts you in a situation where virtually all hosts are 
>> adjacent/neighbors to all others (at least until you're way into the 
>> hundreds of hosts), which means you'll stream from nearly all hosts. If you 
>> drop the number of vnodes from ~256 to ~4 or ~8 or ~16, you'll see the 
>> number of streams drop as well.
>> 
>> Many people with "large" clusters statically allocate tokens to make it 
>> predictable - if you have a single token per host, you can add multiple 
>> hosts at a time, each streaming from a small number of neighbors, without 
>> overlap.
>> 
>> It takes a bit more tooling (or manual token calculation) outside of 
>> cassandra, but works well in practice for "large" clusters.
>> 
>> 
>> 
>> 
>> On Tue, Feb 20, 2018 at 4:42 AM, Jürgen Albersdorfer 
>> > wrote:
>> Hi, I'm wondering if it is possible resp. would it make sense to limit 
>> concurrent streaming when joining a new node to cluster.
>> 
>> I'm currently operating a 15-Node C* Cluster (V 3.11.1) and joining another 
>> Node every day.
>> The 'nodetool netstats' 

Re: Initializing a multiple node cluster (multiple datacenters)

2018-02-23 Thread Jon Haddad
In my opinion and experience, this isn’t a real problem, since you define a 
list of seeds as the first few nodes you add to a cluster.  When would you add 
a node to an existing cluster and mark itself as a seed?  It’s neither 
practical or something you’d do by accident.   

> On Feb 23, 2018, at 10:17 AM, Jeff Jirsa  wrote:
> 
> 
> On Fri, Feb 23, 2018 at 10:12 AM, Oleksandr Shulgin 
> > wrote:
> On Fri, Feb 23, 2018 at 7:02 PM, Jeff Jirsa  > wrote:
> Yes, seeds don't bootstrap.  But why?  I don't think I ever seen a 
> comprehensive explanation of this.
> 
> The meaning of seed in the most common sense is "connect to this host, and 
> use it as the starting point for adding this node to the cluster".
> 
> If you specify that a joining node is the seed, the implication is that it's 
> already a member of the cluster (or, alternatively, authoritative on the 
> cluster's state).  Given that implication, why would it make sense to then 
> proceed to bootstrap? By setting it as a seed, you've told it that it already 
> knows what the cluster is. 
> 
> Well, there is certain logic in that.  However, bootstrap is about streaming 
> in the data, isn't it?  And being seed is about knowing the topology, i.e. 
> which nodes exist in the cluster.  There is actually 0 overlap of these two 
> concerns, so I don't really see why a seed node shouldn't be able to 
> bootstrap.  Would it break anything if it could, e.g. if you're explicit 
> about it and request auto_boostrap=true?
> 
> 
> I dont *think* it would break anything, but the more obvious answer is just 
> not to list the node as a seed if it needs to bootstrap.
> 
> This comes up a lot, and it's certainly one of those rough operator edges 
> that we can do better with. There's no strict requirement to have all of the 
> seeds exactly the same in a cluster, so if you need to bootstrap a new seed, 
> just join it with it not a seed, then bounce it to make it think it's a seed 
> after it's joined.
> 
> The easier answer is probably "give people a way to change seeds after 
> they're running", and it sorta exists, but it's hard to invoke intentionally. 
> We should just make that easier, and the rough edges will get a little less 
> rough.



Re: Initializing a multiple node cluster (multiple datacenters)

2018-02-22 Thread Jon Haddad
In 2.1 token allocation is random, and the distribution doesn’t work as nicely. 
 Everything else is the same.

Do not use 3.1.  Under any circumstances.  Guessing that’s a typo but I just 
want to be sure.

Jon

> On Feb 22, 2018, at 1:45 PM, Jean Carlo <jean.jeancar...@gmail.com> wrote:
> 
> Hi Jonathan
> 
> Yes I do think this is a good idea about the doc. 
> 
> About the clarification, this is still true for the 2.1 ? We are planing 
> upgrading to the 3.1 but not in the next months. We will stick for few more 
> months on the 2.1. 
> 
> I believe this is true also for the 2.1 but I would like to confirm I am 
> missing something 
> 
> 
> Saludos
> 
> Jean Carlo
> 
> "The best way to predict the future is to invent it" Alan Kay
> 
> On Thu, Feb 22, 2018 at 10:28 PM, Kenneth Brotman 
> <kenbrot...@yahoo.com.invalid <mailto:kenbrot...@yahoo.com.invalid>> wrote:
> I will heavy lift the docs for a while, do my Slender Cassandra reference 
> project and then I’ll try to find one or two areas where I can contribute 
> code to get going on that.  I have read the section on contributing before I 
> start.  I’ll self-assign the JIRA right now.
> 
>  
> 
> Kenneth Brotman
> 
>  
> 
> From: Jonathan Haddad [mailto:j...@jonhaddad.com <mailto:j...@jonhaddad.com>] 
> Sent: Thursday, February 22, 2018 1:21 PM
> To: user@cassandra.apache.org <mailto:user@cassandra.apache.org>
> Subject: Re: Initializing a multiple node cluster (multiple datacenters)
> 
>  
> 
> Kenneth, if you want to take the JIRA, feel free to self-assign it to 
> yourself and put up a pull request or patch, and I'll review.  I'd be very 
> happy to get more people involved in the docs.
> 
>  
> 
> On Thu, Feb 22, 2018 at 12:56 PM Kenneth Brotman 
> <kenbrot...@yahoo.com.invalid <mailto:kenbrot...@yahoo.com.invalid>> wrote:
> 
> That information would have saved me time too.  Thanks for making a JIRA for 
> it Jon.  Perhaps this is a good JIRA for me to begin with.
> 
>  
> 
> Kenneth Brotman 
> 
>  
> 
> From: Jon Haddad [mailto:jonathan.had...@gmail.com 
> <mailto:jonathan.had...@gmail.com>] On Behalf Of Jon Haddad
> Sent: Thursday, February 22, 2018 11:11 AM
> To: user
> Subject: Re: Initializing a multiple node cluster (multiple datacenters)
> 
>  
> 
> Great question.  Unfortunately, our OSS docs lack a step by step process on 
> how to add a DC, I’ve created a JIRA to do that: 
> https://issues.apache.org/jira/browse/CASSANDRA-14254 
> <https://issues.apache.org/jira/browse/CASSANDRA-14254>
>  
> 
> The datastax docs are pretty good for this though: 
> https://docs.datastax.com/en/cassandra/latest/cassandra/operations/opsAddDCToCluster.html
>  
> <https://docs.datastax.com/en/cassandra/latest/cassandra/operations/opsAddDCToCluster.html>
>  
> 
> Regarding token allocation, it was random prior to 3.0.  In 3.0 and up, it is 
> calculated a little more intelligently.  in 3.11.2, which was just released, 
> CASSANDRA-13080 was backported which will help out when you add your second 
> DC.  If you go this route, you can drop your token count down to 16 and get 
> all the benefits with no drawbacks.  
> 
>  
> 
> At this point I would go straight to 3.11.2 and skip 3.0 as there were quite 
> a few improvements that make it worthwhile along the way, in my opinion.  We 
> work with several customers that are running 3.11 and are pretty happy with 
> it 
> 
>  
> 
> Yes, if there’s no data, you can initialize the cluster with auto_boostrap: 
> true.  Be sure to change any key spaces using simple strategy to NTS first, 
> and replica them to the new DC as well. 
> 
>  
> 
> Jon
> 
>  
> 
>  
> 
> On Feb 22, 2018, at 10:53 AM, Jean Carlo <jean.jeancar...@gmail.com 
> <mailto:jean.jeancar...@gmail.com>> wrote:
> 
>  
> 
> Hi jonathan
> 
>  
> 
> Thank you for the answer. Do you know where to look to understand why this 
> works. As i understood all the node then will chose ramdoms tokens. How can i 
> assure the correctness of the ring?
> 
>  
> 
> So as you said. Under the condition that there.is <http://there.is/> no data 
> in the cluster. I can initialize a cluster multi dc without disable auto 
> bootstrap.?
> 
>  
> 
> On Feb 22, 2018 5:43 PM, "Jonathan Haddad" <j...@jonhaddad.com 
> <mailto:j...@jonhaddad.com>> wrote:
> 
> If it's a new cluster, there's no need to disable auto_bootstrap.  That 
> setting prevents the first node in the second DC from being a replica for all 
> the data in the first DC.  If there's no data in the first DC, 

Re: Initializing a multiple node cluster (multiple datacenters)

2018-02-22 Thread Jon Haddad
Great question.  Unfortunately, our OSS docs lack a step by step process on how 
to add a DC, I’ve created a JIRA to do that: 
https://issues.apache.org/jira/browse/CASSANDRA-14254 


The datastax docs are pretty good for this though: 
https://docs.datastax.com/en/cassandra/latest/cassandra/operations/opsAddDCToCluster.html
 


Regarding token allocation, it was random prior to 3.0.  In 3.0 and up, it is 
calculated a little more intelligently.  in 3.11.2, which was just released, 
CASSANDRA-13080 was backported which will help out when you add your second DC. 
 If you go this route, you can drop your token count down to 16 and get all the 
benefits with no drawbacks.  

At this point I would go straight to 3.11.2 and skip 3.0 as there were quite a 
few improvements that make it worthwhile along the way, in my opinion.  We work 
with several customers that are running 3.11 and are pretty happy with it. 

Yes, if there’s no data, you can initialize the cluster with auto_boostrap: 
true.  Be sure to change any key spaces using simple strategy to NTS first, and 
replica them to the new DC as well. 

Jon


> On Feb 22, 2018, at 10:53 AM, Jean Carlo  wrote:
> 
> Hi jonathan
> 
> Thank you for the answer. Do you know where to look to understand why this 
> works. As i understood all the node then will chose ramdoms tokens. How can i 
> assure the correctness of the ring?
> 
> So as you said. Under the condition that there.is  no data 
> in the cluster. I can initialize a cluster multi dc without disable auto 
> bootstrap.?
> 
> On Feb 22, 2018 5:43 PM, "Jonathan Haddad"  > wrote:
> If it's a new cluster, there's no need to disable auto_bootstrap.  That 
> setting prevents the first node in the second DC from being a replica for all 
> the data in the first DC.  If there's no data in the first DC, you can skip a 
> couple steps and just leave it on.
> 
> Leave it on, and enjoy your afternoon.
> 
> Seeds don't bootstrap by the way, changing the setting on those nodes doesn't 
> do anything.
> 
> On Thu, Feb 22, 2018 at 8:36 AM Jean Carlo  > wrote:
> Hello
> 
> I would like to clarify this,
> 
> In order to initialize  a  cassandra multi dc cluster, without data. If I  
> follow the documentation datastax
> 
> https://docs.datastax.com/en/cassandra/2.1/cassandra/initialize/initializeMultipleDS.html
>  
> 
> 
> 
> It says
> auto_bootstrap: false (Add this setting only when initializing a clean node 
> with no data.)
> But I dont understand the way this works regarding to the auto_bootstraps. 
> 
> If all the machines make their own tokens in a ramdon way using 
> murmur3partitioner and vnodes , it isn't probable that two nodes will have 
> the tokens in common ?
> It is not better to bootstrap first the seeds with auto_bootstrap: false and 
> then the rest of the nodes with auto_bootstrap: true ?
> 
> 
> Thank you for the help
> 
> Jean Carlo
> 
> "The best way to predict the future is to invent it" Alan Kay
> 



Re: Cassandra Needs to Grow Up by Version Five!

2018-02-21 Thread Jon Haddad
Ken,

Maybe it’s not clear how open source projects work, so let me try to explain.  
There’s a bunch of us who either get paid by someone or volunteer on our free 
time.  The folks that get paid, (yay!) usually take direction on what the 
priorities are, and work on projects that directly affect our jobs.  That means 
that someone needs to care enough about the features you want to work on them, 
if you’re not going to do it yourself. 

Now as others have said already, please put your list of demands in JIRA, if 
someone is interested, they will work on it.  You may need to contribute a 
little more than you’ve done already, be prepared to get involved if you 
actually want to to see something get done.  Perhaps learning a little more 
about Cassandra’s internals and the people involved will reveal some of the 
design decisions and priorities of the project.  

Third, you seem to be a little obsessed with market share.  While market share 
is fun to talk about, *most* of us that are working on and contributing to 
Cassandra do so because it does actually solve a problem we have, and solves it 
reasonably well.  If some magic open source DB appears out of no where and does 
everything you want Cassandra to, and is bug free, keeps your data consistent, 
automatically does backups, comes with really nice cert management, ad hoc 
querying, amazing materialized views that are perfect, no caveats to secondary 
indexes, and somehow still gives you linear scalability without any mental 
overhead whatsoever then sure, people might start using it.  And that’s 
actually OK, because if that happens we’ll all be incredibly pumped out of our 
minds because we won’t have to work as hard.  If on the slim chance that 
doesn’t manifest, those of us that use Cassandra and are part of the community 
will keep working on the things we care about, iterating, and improving things. 
 Maybe someone will even take a look at your JIRA issues.  

Further filling the mailing list with your grievances will likely not help you 
progress towards your goal of a Cassandra that’s easier to use, so I encourage 
you to try to be a little more productive and try to help rather than just 
complain, which is not constructive.  I did a quick search for your name on the 
mailing list, and I’ve seen very little from you, so to everyone’s who’s been 
around for a while and trying to help you it looks like you’re just some random 
dude asking for people to work for free on the things you’re asking for, 
without offering anything back in return.

Jon


> On Feb 21, 2018, at 11:56 AM, Kenneth Brotman  
> wrote:
> 
> Josh, 
> 
> To say nothing is indifference.  If you care about your community, sometimes 
> don't you have to bring up a subject even though you know it's also 
> temporarily adding some discomfort?  
> 
> As to opening a JIRA, I've got a very specific topic to try in mind now.  An 
> easy one I'll work on and then announce.  Someone else will have to do the 
> coding.  A year from now I would probably just knock it out to make sure it's 
> as easy as I expect it to be but to be honest, as I've been saying, I'm not 
> set up to do that right now.  I've barely looked at any Cassandra code; for 
> one; everyone on this list probably codes more than I do, secondly; and 
> lastly, it's a good one for someone that wants an easy one to start with: 
> vNodes.  I've already seen too many people seeking assistance with the vNode 
> setting.
> 
> And you can expect as others have been mentioning that there should be 
> similar ones on compaction, repair and backup. 
> 
> Microsoft knows poor usability gives them an easy market to take over. And 
> they make it easy to switch.
> 
> Beginning at 4:17 in the video, it says the following:
> 
>   "You don't need to worry about replica sets, quorum or read repair.  
> You can focus on writing correct application logic."
> 
> At 4:42, it says:
>   "Hopefully this gives you a quick idea of how seamlessly you can bring 
> your existing Cassandra applications to Azure Cosmos DB.  No code changes are 
> required.  It works with your favorite Cassandra tools and drivers including 
> for example native Cassandra driver for Spark. And it takes seconds to get 
> going, and it's elastically and globally scalable."
> 
> More to come,
> 
> Kenneth Brotman
> 
> -Original Message-
> From: Josh McKenzie [mailto:jmcken...@apache.org] 
> Sent: Wednesday, February 21, 2018 8:28 AM
> To: d...@cassandra.apache.org
> Cc: User
> Subject: Re: Cassandra Needs to Grow Up by Version Five!
> 
> There's a disheartening amount of "here's where Cassandra is bad, and here's 
> what it needs to do for me for free" happening in this thread.
> 
> This is open-source software. Everyone is *strongly encouraged* to submit a 
> patch to move the needle on *any* of these things being complained about in 
> this thread.
> 
> For the Apache Way  to 

Re: Memtable flush -> SSTable: customizable or same for all compaction strategies?

2018-02-20 Thread Jon Haddad
The file format is independent from compaction.  A compaction strategy only 
selects sstables to be compacted, that’s it’s only job.  It could have side 
effects, like generating other files, but any decent compaction strategy will 
account for the fact that those other files don’t exist. 

I wrote a blog post a few months ago going over some of the nuance of 
compaction you mind find informative: 
http://thelastpickle.com/blog/2017/03/16/compaction-nuance.html 


This is also the wrong mailing list, please direct future user questions to the 
user list.  The dev list is for development of Cassandra itself.

Jon

> On Feb 20, 2018, at 1:10 PM, Carl Mueller  
> wrote:
> 
> When memtables/CommitLogs are flushed to disk/sstable, does the sstable go
> through sstable organization specific to each compaction strategy, or is
> the sstable creation the same for all compactionstrats and it is up to the
> compaction strategy to recompact the sstable if desired?



  1   2   3   >