Re: Validation of NetworkTopologyStrategy data center name in Cassandra 4.0

2021-08-10 Thread Jens Fischer
Thanks for providing the links Erick, very helpful. Although it is slightly 
inconvenient for me I now better understand the motivation.

On 10. Aug 2021, at 10:27, Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:

You are correct. Cassandra no longer allows invalid DC names for 
NetworkTopologyStrategy in CREATE KEYSPACE or ALTER KEYSPACE from 4.0 
(CASSANDRA-12681). FWIW, 
here is the 
NEWS.txt
 entry for reference. I'm not aware of a hack that would circumvent the 
validation. Cheers!


Gesch?ftsf?hrer: Oliver Koch, Bianca Swanston
Amtsgericht Kempten/Allg?u, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Validation of NetworkTopologyStrategy data center name in Cassandra 4.0

2021-08-10 Thread Jens Fischer
Hi,

in Cassandra 3.11.x I was able to create key spaces with basically arbitrary 
names for the data center. When I do this with Cassandra 4.0 I get a 
"ConfigurationException: Unrecognized strategy option {} 
passed to NetworkTopologyStrategy for keyspace ”.

This breaks some unit tests in our CI where we test CREATE KEYSPACE statements 
for different clusters on a single node test instance.

The only documentation I found is an issue from ScyllaDB: 
https://github.com/scylladb/scylla/issues/7595. It seems Cassandra 4.0 added 
some validation on the data center name. I verified that I can get rid of the 
error by configuring a DC in the cassandra-rackdc.properties and enabling 
endpoint_snitch: GossipingPropertyFileSnitch in cassandra.yaml. This, of 
course, is not very practical for unit tests because we would need to change 
the configuration of Cassandra (and restart) before any of the unit tests. No 
problem for production of course, there the cluster is configured accordingly.

Is there a way to disable the validations for testing purposes? Or to change 
them dynamically?

Any help is appreciated!
Jens


Geschäftsführer: Oliver Koch, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Re: Log Rotation of Extended Compaction Logging

2021-04-09 Thread Jens Fischer
Hi Erik,

thank you for the link, very instructive.

To summarise my understanding of your mail, the code and my experiments:

- as long as the compaction logger is running it will write into the same 
“compaction.log" file
- if a new logger gets started (for example through restart of the Cassandra 
Node) the file current file will be moved to “compaction-.log” 
and a new  “compaction.log" file will be created
- files will never be archived (compressed) or deleted

Correct?

Best
Jens
Geschäftsführer: Jean-Baptiste Cornefert, Oliver Koch, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Log Rotation of Extended Compaction Logging

2021-04-07 Thread Jens Fischer
Hi,

Does anybody know the configuration for Extended Compaction Logging[1]? When is 
a log rotation triggered and how many files are kept?

I did some googling, found [2] and checked the configuration in 
/etc/cassandra/logback.xml, neither does mention anything about compaction 
logging.

I am using Cassandra 3.11.6.

Any help is appreciated.

Best Regards
Jens

[1]: 
https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlCreateTable.html?hl=log_all#compactSubprop__enabling-extended-compaction-logging
[2]: 
https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/configuration/configLoggingLevels.html


Gesch?ftsf?hrer: Jean-Baptiste Cornefert, Oliver Koch, Bianca Swanston
Amtsgericht Kempten/Allg?u, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Re: Network Bandwidth and Multi-DC replication

2020-12-16 Thread Jens Fischer
Hallo Jeff,

very interesting stuff, thank you for sharing!

Indeed, I am storing time-series data. The table has 67 columns. Writing is 
done in two steps: First 43 fields (3 primary key fields and 40 data fields) 
than 27 fields (3 primary key fields and 24 data fields) in a second step, 
always one row (one timestamp) at a time. The two write steps happen within 
milliseconds of each other, i.e. in the vast majority of cases the column 
should be consolidated in the memtable before hitting the disk. Extrapolating 
from your example I would think that this should not be the cause of the 
excessive bandwidth usage?

Best
Jens

On 15. Dec 2020, at 17:47, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:

There's a small amount of overhead on each packet for serialization - e.g., 
each mutation is tied to a column family (uuid) and gets serialized with sizes 
and checksums, so I guess there's a point where your updates are small enough 
that the overhead of the mutations starts being visible.


You mentioned you're storing time-series. Hard to know what your time series 
actually is, but pretending it's recording weather over time (easy throwaway 
example): if you're writing 100kb chunks of text or json at various timestamps 
(e.g. "wind speed, wind direction, low temp, high temp, precipitation volume, 
precipitation type, small craft advisory, hurricane warning"), you won't notice 
the serialized sizes or uuid overhead. But if you're setting 
location=temperature values, that's pretty small, and the overhead starts 
showing up



On Tue, Dec 15, 2020 at 8:39 AM Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:
Hi Scott,

Thank you for your help. There was an error or at least an ambiguity in my 
second Mail! I wrote:

I still see outgoing cross-DC traffic of ~ 2x the “write size A”

What I wanted to say was:  I still see outgoing cross-DC traffic of ~ 2x the 
“write size A” per remote DC or 4x the "write size A” in total.

Your response underlines that this is way more than expected. Any idea what 
could cause this or how to further debug? As mentioned I already checked for 
anti-entropy repair (not running) and read repairs (read_repair_chance and 
dclocal_read_repair_chance set to 0) and hints (no hints replay according to 
logs, hint directory empty).

Best
Jens

On 10. Dec 2020, at 01:28, Scott Hirleman 
mailto:scott.hirle...@gmail.com>> wrote:

2x makes sense tho. If you have 3 DCs, you write locally to DC1 and then it 
gets replicated once in DC1 and then it gets replicated to DC2 AND DC3 at 
consistency local_one via cross DCtraffic to one of the nodes in each DC, then 
replicated in each DC to a second node via local traffic

Write comes in to DC1, node 1; it replicates to 1) DC1 node 2, DC2 node 1, and 
DC3 node 1. So the outgoing traffic is 2x the write size by going to each DC2 
and DC3. Once it gets written to DC2 node 1, it gets replicated locally to DC2 
node 2; same for DC3 re DC3 node 2.

On Wed, Dec 2, 2020 at 9:36 AM Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:
Hi,

I checked for all the given other factors - anti entropy repair, hints, read 
repair - and I still see outgoing cross-DC traffic of ~ 2x the “write size A” 
(as defined below). Given Jeffs answers this is not to be expected, i.e. there 
is something wrong here. Does anybody have an idea how to debug?

I define the "write size A”  as follows: Take the incoming traffic from all 
nodes inserting into DC1 and sum it up.

Best
Jens

On 30. Nov 2020, at 12:00, Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:

Hi Jeff,

Thank you for your answer, very helpful already!

All writes are done with `LOCAL_ONE` and we have RF=2 in each data center.

To compare our examples we need to come to an agreement on what you are calling 
“write size A”. I gave two different write sizes:

I call the bandwidth for receiving the the data on Node A "base bandwidth”

This is the inbound traffic at Node A. Data to Node A is transmitted as 
Protobuf inside VPN tunnels. A very rough estimate of data size, I know. Node A 
is not a Cassandra node!


Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth

I looked at all the Cassandra nodes in DC1 and the traffic coming from Node A. 
I then summed up this traffic.
@Jeff: I assume this is closer to what you call “write size A”?

Best
Jens


On 26. Nov 2020, at 17:12, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:



On Nov 26, 2020, at 9:53 AM, Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:

 Hi,

we run a Cassandra cluster with three DCs. We noticed that the traffic incurred 
by running the Cluster is significant.

Consider the following simplified IoT scenario:

* time series data from devices in the field is received at Node A
* Node A inserts the data into DC 1
* DC 1 replicates the data within the DC and two the other two DCs

The traffic this produces is significant. The numbers below are based on 
observing the incom

Re: Network Bandwidth and Multi-DC replication

2020-12-15 Thread Jens Fischer
Hi Scott,

Thank you for your help. There was an error or at least an ambiguity in my 
second Mail! I wrote:

I still see outgoing cross-DC traffic of ~ 2x the “write size A”

What I wanted to say was:  I still see outgoing cross-DC traffic of ~ 2x the 
“write size A” per remote DC or 4x the "write size A” in total.

Your response underlines that this is way more than expected. Any idea what 
could cause this or how to further debug? As mentioned I already checked for 
anti-entropy repair (not running) and read repairs (read_repair_chance and 
dclocal_read_repair_chance set to 0) and hints (no hints replay according to 
logs, hint directory empty).

Best
Jens

On 10. Dec 2020, at 01:28, Scott Hirleman 
mailto:scott.hirle...@gmail.com>> wrote:

2x makes sense tho. If you have 3 DCs, you write locally to DC1 and then it 
gets replicated once in DC1 and then it gets replicated to DC2 AND DC3 at 
consistency local_one via cross DCtraffic to one of the nodes in each DC, then 
replicated in each DC to a second node via local traffic

Write comes in to DC1, node 1; it replicates to 1) DC1 node 2, DC2 node 1, and 
DC3 node 1. So the outgoing traffic is 2x the write size by going to each DC2 
and DC3. Once it gets written to DC2 node 1, it gets replicated locally to DC2 
node 2; same for DC3 re DC3 node 2.

On Wed, Dec 2, 2020 at 9:36 AM Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:
Hi,

I checked for all the given other factors - anti entropy repair, hints, read 
repair - and I still see outgoing cross-DC traffic of ~ 2x the “write size A” 
(as defined below). Given Jeffs answers this is not to be expected, i.e. there 
is something wrong here. Does anybody have an idea how to debug?

I define the "write size A”  as follows: Take the incoming traffic from all 
nodes inserting into DC1 and sum it up.

Best
Jens

On 30. Nov 2020, at 12:00, Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:

Hi Jeff,

Thank you for your answer, very helpful already!

All writes are done with `LOCAL_ONE` and we have RF=2 in each data center.

To compare our examples we need to come to an agreement on what you are calling 
“write size A”. I gave two different write sizes:

I call the bandwidth for receiving the the data on Node A "base bandwidth”

This is the inbound traffic at Node A. Data to Node A is transmitted as 
Protobuf inside VPN tunnels. A very rough estimate of data size, I know. Node A 
is not a Cassandra node!


Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth

I looked at all the Cassandra nodes in DC1 and the traffic coming from Node A. 
I then summed up this traffic.
@Jeff: I assume this is closer to what you call “write size A”?

Best
Jens


On 26. Nov 2020, at 17:12, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:



On Nov 26, 2020, at 9:53 AM, Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:

 Hi,

we run a Cassandra cluster with three DCs. We noticed that the traffic incurred 
by running the Cluster is significant.

Consider the following simplified IoT scenario:

* time series data from devices in the field is received at Node A
* Node A inserts the data into DC 1
* DC 1 replicates the data within the DC and two the other two DCs

The traffic this produces is significant. The numbers below are based on 
observing the incoming and outgoing traffic on the node level:

* I call the bandwidth for receiving the the data on Node A "base bandwidth"
* Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth
* Replication to each of the other data centres takes 5 times the base bandwidth
* overall we see a “bandwidth amplification” of ~ 13x (3+5+5)


You didn’t specify consistency levels or replication factors so it’s hard to 
check your math.

Here’s what I’d expect

If you do RF=3 per DC and have 3 DCs, a write of size A is written to the 
cluster using coordinator C

C sends that write to replicas R1, R2, and R3 in the local DC
C sends the write to F2 and F3 - forwarders - one in each remote DC
F2 sends the write to R1-2, R2-2 in the remote DC2 and itself (F2 will be a 
replica), each replica sends an ack back to C
F3 sends the write to R1-3, R2-3 in the remote DC3 and itself (F3 will be a 
replica), each replica sends an ack back to C

You can avoid one extra write using token aware routing and making C a replica 
(R1, for example)

Given this, I don’t see how a remote DC is 5x A - it should be cross DC/WAN 
cost A into the forwarder and 2A out of the forwarder (local traffic , 
cross-AZ/rack but not WAN), with trivial ACK cost to the original DC.

If you’re seeing more than this, it may be something other than pure writes - 
anti entropy repair, hints, read repair are all possible, and would have 
different causes and fixes.

Most people who get to this level of calculation are doing so because they’re 
trying to solve a problem, and the common problem is that cross-AZ traffic in 
cloud providers is expensive at scale. If

Re: Network Bandwidth and Multi-DC replication

2020-12-02 Thread Jens Fischer
Hi,

I checked for all the given other factors - anti entropy repair, hints, read 
repair - and I still see outgoing cross-DC traffic of ~ 2x the “write size A” 
(as defined below). Given Jeffs answers this is not to be expected, i.e. there 
is something wrong here. Does anybody have an idea how to debug?

I define the "write size A”  as follows: Take the incoming traffic from all 
nodes inserting into DC1 and sum it up.

Best
Jens

On 30. Nov 2020, at 12:00, Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:

Hi Jeff,

Thank you for your answer, very helpful already!

All writes are done with `LOCAL_ONE` and we have RF=2 in each data center.

To compare our examples we need to come to an agreement on what you are calling 
“write size A”. I gave two different write sizes:

I call the bandwidth for receiving the the data on Node A "base bandwidth”

This is the inbound traffic at Node A. Data to Node A is transmitted as 
Protobuf inside VPN tunnels. A very rough estimate of data size, I know. Node A 
is not a Cassandra node!


Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth

I looked at all the Cassandra nodes in DC1 and the traffic coming from Node A. 
I then summed up this traffic.
@Jeff: I assume this is closer to what you call “write size A”?

Best
Jens


On 26. Nov 2020, at 17:12, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:



On Nov 26, 2020, at 9:53 AM, Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:

 Hi,

we run a Cassandra cluster with three DCs. We noticed that the traffic incurred 
by running the Cluster is significant.

Consider the following simplified IoT scenario:

* time series data from devices in the field is received at Node A
* Node A inserts the data into DC 1
* DC 1 replicates the data within the DC and two the other two DCs

The traffic this produces is significant. The numbers below are based on 
observing the incoming and outgoing traffic on the node level:

* I call the bandwidth for receiving the the data on Node A "base bandwidth"
* Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth
* Replication to each of the other data centres takes 5 times the base bandwidth
* overall we see a “bandwidth amplification” of ~ 13x (3+5+5)


You didn’t specify consistency levels or replication factors so it’s hard to 
check your math.

Here’s what I’d expect

If you do RF=3 per DC and have 3 DCs, a write of size A is written to the 
cluster using coordinator C

C sends that write to replicas R1, R2, and R3 in the local DC
C sends the write to F2 and F3 - forwarders - one in each remote DC
F2 sends the write to R1-2, R2-2 in the remote DC2 and itself (F2 will be a 
replica), each replica sends an ack back to C
F3 sends the write to R1-3, R2-3 in the remote DC3 and itself (F3 will be a 
replica), each replica sends an ack back to C

You can avoid one extra write using token aware routing and making C a replica 
(R1, for example)

Given this, I don’t see how a remote DC is 5x A - it should be cross DC/WAN 
cost A into the forwarder and 2A out of the forwarder (local traffic , 
cross-AZ/rack but not WAN), with trivial ACK cost to the original DC.

If you’re seeing more than this, it may be something other than pure writes - 
anti entropy repair, hints, read repair are all possible, and would have 
different causes and fixes.

Most people who get to this level of calculation are doing so because they’re 
trying to solve a problem, and the common problem is that cross-AZ traffic in 
cloud providers is expensive at scale. If that’s why you’re asking, compression 
is your obvious win, and reducing RF is your alternative option (3/3/3 is super 
expensive - how many dcs take writes directly and which consistency level are 
you using? What’s the point of having 9 copies of the data? Would 1 copy per dc 
be enough if you’re doing global quorum? Would 2 copies in the cold DCs be 
enough if you’re only reading / writing from one DC?).

My questions:

1. Would you considers these factors expected behaviour?

13 seems high. 9 seems more correct unless you’re double counting sending and 
receiving.

2. Are there ways to reduce the traffic through configuration?

Compression, reducing RF, maybe mitigation with longer timeouts to avoid double 
sending hints.


A few additional notes on the setup:

* use NetworkTopologyStrategy for replication and cassandra-rackdc.properties 
to configure the GossipingPropertyFileSnitch
* internode_compression is set to dc
* inter_dc_tcp_nodelay is set to false

Any help is highly appreciated!

Best Regards
Jens

Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 1

Re: Network Bandwidth and Multi-DC replication

2020-11-30 Thread Jens Fischer
Hi Jeff,

Thank you for your answer, very helpful already!

All writes are done with `LOCAL_ONE` and we have RF=2 in each data center.

To compare our examples we need to come to an agreement on what you are calling 
“write size A”. I gave two different write sizes:

I call the bandwidth for receiving the the data on Node A "base bandwidth”

This is the inbound traffic at Node A. Data to Node A is transmitted as 
Protobuf inside VPN tunnels. A very rough estimate of data size, I know. Node A 
is not a Cassandra node!


Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth

I looked at all the Cassandra nodes in DC1 and the traffic coming from Node A. 
I then summed up this traffic.
@Jeff: I assume this is closer to what you call “write size A”?

Best
Jens


On 26. Nov 2020, at 17:12, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:



On Nov 26, 2020, at 9:53 AM, Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:

 Hi,

we run a Cassandra cluster with three DCs. We noticed that the traffic incurred 
by running the Cluster is significant.

Consider the following simplified IoT scenario:

* time series data from devices in the field is received at Node A
* Node A inserts the data into DC 1
* DC 1 replicates the data within the DC and two the other two DCs

The traffic this produces is significant. The numbers below are based on 
observing the incoming and outgoing traffic on the node level:

* I call the bandwidth for receiving the the data on Node A "base bandwidth"
* Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth
* Replication to each of the other data centres takes 5 times the base bandwidth
* overall we see a “bandwidth amplification” of ~ 13x (3+5+5)


You didn’t specify consistency levels or replication factors so it’s hard to 
check your math.

Here’s what I’d expect

If you do RF=3 per DC and have 3 DCs, a write of size A is written to the 
cluster using coordinator C

C sends that write to replicas R1, R2, and R3 in the local DC
C sends the write to F2 and F3 - forwarders - one in each remote DC
F2 sends the write to R1-2, R2-2 in the remote DC2 and itself (F2 will be a 
replica), each replica sends an ack back to C
F3 sends the write to R1-3, R2-3 in the remote DC3 and itself (F3 will be a 
replica), each replica sends an ack back to C

You can avoid one extra write using token aware routing and making C a replica 
(R1, for example)

Given this, I don’t see how a remote DC is 5x A - it should be cross DC/WAN 
cost A into the forwarder and 2A out of the forwarder (local traffic , 
cross-AZ/rack but not WAN), with trivial ACK cost to the original DC.

If you’re seeing more than this, it may be something other than pure writes - 
anti entropy repair, hints, read repair are all possible, and would have 
different causes and fixes.

Most people who get to this level of calculation are doing so because they’re 
trying to solve a problem, and the common problem is that cross-AZ traffic in 
cloud providers is expensive at scale. If that’s why you’re asking, compression 
is your obvious win, and reducing RF is your alternative option (3/3/3 is super 
expensive - how many dcs take writes directly and which consistency level are 
you using? What’s the point of having 9 copies of the data? Would 1 copy per dc 
be enough if you’re doing global quorum? Would 2 copies in the cold DCs be 
enough if you’re only reading / writing from one DC?).

My questions:

1. Would you considers these factors expected behaviour?

13 seems high. 9 seems more correct unless you’re double counting sending and 
receiving.

2. Are there ways to reduce the traffic through configuration?

Compression, reducing RF, maybe mitigation with longer timeouts to avoid double 
sending hints.


A few additional notes on the setup:

* use NetworkTopologyStrategy for replication and cassandra-rackdc.properties 
to configure the GossipingPropertyFileSnitch
* internode_compression is set to dc
* inter_dc_tcp_nodelay is set to false

Any help is highly appreciated!

Best Regards
Jens

Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Network Bandwidth and Multi-DC replication

2020-11-26 Thread Jens Fischer
Hi,

we run a Cassandra cluster with three DCs. We noticed that the traffic incurred 
by running the Cluster is significant.

Consider the following simplified IoT scenario:

* time series data from devices in the field is received at Node A
* Node A inserts the data into DC 1
* DC 1 replicates the data within the DC and two the other two DCs

The traffic this produces is significant. The numbers below are based on 
observing the incoming and outgoing traffic on the node level:

* I call the bandwidth for receiving the the data on Node A "base bandwidth"
* Inserting into Cassandra (in one DC) takes 2-3 times the base bandwidth
* Replication to each of the other data centres takes 5 times the base bandwidth
* overall we see a “bandwidth amplification” of ~ 13x (3+5+5)

My questions:

1. Would you considers these factors expected behaviour?
2. Are there ways to reduce the traffic through configuration?

A few additional notes on the setup:

* use NetworkTopologyStrategy for replication and cassandra-rackdc.properties 
to configure the GossipingPropertyFileSnitch
* internode_compression is set to dc
* inter_dc_tcp_nodelay is set to false

Any help is highly appreciated!

Best Regards
Jens

Geschäftsführer: Oliver Koch (CEO), Jean-Baptiste Cornefert, Christoph 
Ostermann, Hermann Schweizer, Bianca Swanston
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Re: Multi-DC replication and hinted handoff

2019-04-09 Thread Jens Fischer
Hi,

an update: I am pretty sure it is a problem with insufficient bandwidth. I 
can’t be sure because Cassandra does not seem to provide debug information on 
hint creation (only when replaying hints). When the bandwidth issue is solved I 
will try to reproduce the accumulation of hints by artificially limiting the 
bandwidth.

BG
Jens

On 3. Apr 2019, at 01:48, Stefan Miklosovic 
mailto:stefan.mikloso...@instaclustr.com>> 
wrote:

Hi Jens,

I am reading Cassandra The definitive guide and there is a chapter 9 - Reading 
and Writing Data and section The Cassandra Write Path and this sentence in it:

If a replica does not respond within the timeout, it is presumed to be down and 
a hint is stored for the write.

So your node might be actually fine eventually but it just can not cope with 
the load and it will reply too late after a coordinator has sufficient replies 
from other replicas. So it makes a hint for that write and for that node. I am 
not sure how is this related to turning off handoffs completely. I can do some 
tests locally if time allows to investigate various scenarios. There might be 
some subtle differences 

On Wed, 3 Apr 2019 at 07:19, Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:
Yes, Apache Cassandra 3.11.2 (no DSE).

On 2. Apr 2019, at 19:40, sankalp kohli 
mailto:kohlisank...@gmail.com>> wrote:

Are you using OSS C*?

On Fri, Mar 29, 2019 at 1:49 AM Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:
Hi,

I have a Cassandra setup with multiple data centres. The vast majority of 
writes are LOCAL_ONE writes to data center DC-A. One node (lets call this node 
A1) in DC-A has accumulated large amounts of hint files (~100 GB). In the logs 
of this node I see lots of messages like the following:

INFO  [HintsDispatcher:26] 2019-03-28 01:49:25,217 
HintsDispatchExecutor.java:289 - Finished hinted handoff of file 
db485ac6-8acd-4241-9e21-7a2b540459de-1553419324363-1.hints to endpoint 
/10.10.2.55<http://10.10.2.55/>: db485ac6-8acd-4241-9e21-7a2b540459de

The node 10.10.2.55 is in DC-B, lets call this node B1. There is no indication 
whatsoever that B1 was down: Nothing in our monitoring, nothing in the logs of 
B1, nothing in the logs of A1. Are there any other situations where hints to B1 
are stored at A1? Other than A1's failure detection detecting B1 as down I 
mean. For example could the reason for the hints be that B1 is overloaded and 
can not handle the intake from the A1? Or that the network connection between 
DC-A and DC-B is to slow?

While researching this I also found the following information on Stack Overflow 
from Ben Slater regarding hints and multi-dc replication:

Another factor here is the consistency level you are using - a LOCAL_* 
consistency level will only require writes to be written to the local DC for 
the operation to be considered a success (and hints will be stored for 
replication to the other DC).
(…)
The hints are the records of writes that have been made in one DC that are not 
yet replicated to the other DC (or even nodes within a DC). I think your 
options to avoid them are: (1) write with ALL or QUOROM (not LOCAL_*) 
consistency - this will slow down your writes but will ensure writes go into 
both DCs before the op completes (2) Don't replicate the data to the second DC 
(by setting the replication factor to 0 for the second DC in the keyspace 
definition) (3) Increase the capacity of the second DC so it can keep up with 
the writes (4) Slow down your writes so the second DC can keep up.

Source: https://stackoverflow.com/a/37382726

This reads like hints are used for “normal” (async) replication between data 
centres, i.e. hints could show up without any nodes being down whatsoever. This 
could explain what I am seeing. Does anyone now more about this? Does that mean 
I will see hints even if I disable hinted handoff?

Any pointers or help are greatly appreciated!

Thanks in advance
Jens


[https://img.sonnen.de/TSEE2019_Banner_sonnenGmbH_de_1.jpg]

Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen Schneider, 
Hermann Schweizer.
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


[https://img.sonnen.de/TSEE2019_Banner_sonnenGmbH_de_1.jpg]

Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen Schneider, 
Hermann Schweizer.
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


[https://img.sonnen.de/TSEE2019_Banner_sonnenGmbH_de_1.jpg]

Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen Schneider, 
Hermann Schweizer.
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Re: Multi-DC replication and hinted handoff

2019-04-02 Thread Jens Fischer
Yes, Apache Cassandra 3.11.2 (no DSE).

On 2. Apr 2019, at 19:40, sankalp kohli 
mailto:kohlisank...@gmail.com>> wrote:

Are you using OSS C*?

On Fri, Mar 29, 2019 at 1:49 AM Jens Fischer 
mailto:j.fisc...@sonnen.de>> wrote:
Hi,

I have a Cassandra setup with multiple data centres. The vast majority of 
writes are LOCAL_ONE writes to data center DC-A. One node (lets call this node 
A1) in DC-A has accumulated large amounts of hint files (~100 GB). In the logs 
of this node I see lots of messages like the following:

INFO  [HintsDispatcher:26] 2019-03-28 01:49:25,217 
HintsDispatchExecutor.java:289 - Finished hinted handoff of file 
db485ac6-8acd-4241-9e21-7a2b540459de-1553419324363-1.hints to endpoint 
/10.10.2.55<http://10.10.2.55/>: db485ac6-8acd-4241-9e21-7a2b540459de

The node 10.10.2.55 is in DC-B, lets call this node B1. There is no indication 
whatsoever that B1 was down: Nothing in our monitoring, nothing in the logs of 
B1, nothing in the logs of A1. Are there any other situations where hints to B1 
are stored at A1? Other than A1's failure detection detecting B1 as down I 
mean. For example could the reason for the hints be that B1 is overloaded and 
can not handle the intake from the A1? Or that the network connection between 
DC-A and DC-B is to slow?

While researching this I also found the following information on Stack Overflow 
from Ben Slater regarding hints and multi-dc replication:

Another factor here is the consistency level you are using - a LOCAL_* 
consistency level will only require writes to be written to the local DC for 
the operation to be considered a success (and hints will be stored for 
replication to the other DC).
(…)
The hints are the records of writes that have been made in one DC that are not 
yet replicated to the other DC (or even nodes within a DC). I think your 
options to avoid them are: (1) write with ALL or QUOROM (not LOCAL_*) 
consistency - this will slow down your writes but will ensure writes go into 
both DCs before the op completes (2) Don't replicate the data to the second DC 
(by setting the replication factor to 0 for the second DC in the keyspace 
definition) (3) Increase the capacity of the second DC so it can keep up with 
the writes (4) Slow down your writes so the second DC can keep up.

Source: https://stackoverflow.com/a/37382726

This reads like hints are used for “normal” (async) replication between data 
centres, i.e. hints could show up without any nodes being down whatsoever. This 
could explain what I am seeing. Does anyone now more about this? Does that mean 
I will see hints even if I disable hinted handoff?

Any pointers or help are greatly appreciated!

Thanks in advance
Jens


[https://img.sonnen.de/TSEE2019_Banner_sonnenGmbH_de_1.jpg]

Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen Schneider, 
Hermann Schweizer.
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


[https://img.sonnen.de/TSEE2019_Banner_sonnenGmbH_de_1.jpg]

Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen Schneider, 
Hermann Schweizer.
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908


Multi-DC replication and hinted handoff

2019-03-29 Thread Jens Fischer
Hi,

I have a Cassandra setup with multiple data centres. The vast majority of 
writes are LOCAL_ONE writes to data center DC-A. One node (lets call this node 
A1) in DC-A has accumulated large amounts of hint files (~100 GB). In the logs 
of this node I see lots of messages like the following:

INFO  [HintsDispatcher:26] 2019-03-28 01:49:25,217 
HintsDispatchExecutor.java:289 - Finished hinted handoff of file 
db485ac6-8acd-4241-9e21-7a2b540459de-1553419324363-1.hints to endpoint 
/10.10.2.55: db485ac6-8acd-4241-9e21-7a2b540459de

The node 10.10.2.55 is in DC-B, lets call this node B1. There is no indication 
whatsoever that B1 was down: Nothing in our monitoring, nothing in the logs of 
B1, nothing in the logs of A1. Are there any other situations where hints to B1 
are stored at A1? Other than A1's failure detection detecting B1 as down I 
mean. For example could the reason for the hints be that B1 is overloaded and 
can not handle the intake from the A1? Or that the network connection between 
DC-A and DC-B is to slow?

While researching this I also found the following information on Stack Overflow 
from Ben Slater regarding hints and multi-dc replication:

Another factor here is the consistency level you are using - a LOCAL_* 
consistency level will only require writes to be written to the local DC for 
the operation to be considered a success (and hints will be stored for 
replication to the other DC).
(…)
The hints are the records of writes that have been made in one DC that are not 
yet replicated to the other DC (or even nodes within a DC). I think your 
options to avoid them are: (1) write with ALL or QUOROM (not LOCAL_*) 
consistency - this will slow down your writes but will ensure writes go into 
both DCs before the op completes (2) Don't replicate the data to the second DC 
(by setting the replication factor to 0 for the second DC in the keyspace 
definition) (3) Increase the capacity of the second DC so it can keep up with 
the writes (4) Slow down your writes so the second DC can keep up.

Source: https://stackoverflow.com/a/37382726

This reads like hints are used for “normal” (async) replication between data 
centres, i.e. hints could show up without any nodes being down whatsoever. This 
could explain what I am seeing. Does anyone now more about this? Does that mean 
I will see hints even if I disable hinted handoff?

Any pointers or help are greatly appreciated!

Thanks in advance
Jens


[https://img.sonnen.de/TSEE2019_Banner_sonnenGmbH_de_1.jpg]

Geschäftsführer: Christoph Ostermann (CEO), Oliver Koch, Steffen Schneider, 
Hermann Schweizer.
Amtsgericht Kempten/Allgäu, Registernummer: 10655, Steuernummer 127/137/50792, 
USt.-IdNr. DE272208908