I need to batch load a lot of data everyday into a keyspace across two DCs, one DC is at west coast and the other is at east coast.
I assume that the network delay between two DCs at different sites will cause a lot of dropped mutation messages if I write too fast in LOCAL DC using LOCAL_QUORUM. I did this test: the test cluster has two DCs in one network at the same site, but the configuration of the remote DC is lower than the local one. When I used LOCAL_QUORUM and wrote fast enough, I observed a lot of dropped mutation messages in the remote DC. So I guess the same thing will happen if two DCs are at different sites. To my understanding, the coordinator in the LOCAL dc will send write requests to all copies including the remote copies, and return SUCCESS to the client once the quorum of the copies in LOCAL dc respond. Due to the network delay, the remote side will process the requests with a delay, and new requests to the remote side arrive at the speed of LOCAL dc. Eventually, the requests in the queue will exceed the timeout, and the dropped mutation messages happen. But I am not sure if my analysis is correct because the above analysis doesn't consider that there are more connections than one DC situation and if the network bandwidth slows down the process in LOCAL DC. If my analysis is correct, the solution could be either slow down the batch load speed, or configure remote side with longer timeout. My question is how can I design some tests to find out how slow will be for the batch load to avoid dropped mutation messages at the remote site. If my analysis is wrong, could you explain what actually happens in this situation? Thanks.