Dropped Mutation Messages in two DCs at different sites

Benyi Wang Tue, 03 Jan 2017 13:36:05 -0800

I need to batch load a lot of data everyday into a keyspace across two DCs,
 one DC is at west coast and the other is at east coast.


I assume that the network delay between two DCs at different sites will
cause a lot of dropped mutation messages if I write too fast in LOCAL DC
using LOCAL_QUORUM.

I did this test: the test cluster has two DCs in one network at the same
site, but the configuration of the remote DC is lower than the local one.
When I used LOCAL_QUORUM and wrote fast enough, I observed a lot of dropped
mutation messages in the remote DC. So I guess the same thing will happen
if two DCs are at different sites.

To my understanding, the coordinator in the LOCAL dc will send write
requests to all copies including the remote copies, and return SUCCESS to
the client once the quorum of the copies in LOCAL dc respond. Due to the
network delay, the remote side will process the requests with a delay, and
new requests to the remote side arrive at the speed of LOCAL dc.
Eventually, the requests in the queue will exceed the timeout, and the
dropped mutation messages happen.

But I am not sure if my analysis is correct because the above analysis
doesn't consider that there are more connections than one DC situation and
if the network bandwidth slows down the process in LOCAL DC.

If my analysis is correct, the solution could be either slow down the batch
load speed, or configure remote side with longer timeout. My question is
how can I design some tests to find out how slow will be for the batch load
to avoid dropped mutation messages at the remote site.

If my analysis is wrong, could you explain what actually happens in this
situation?

Thanks.

Dropped Mutation Messages in two DCs at different sites

Reply via email to