Hi Voytek,
I think the way you are using it is definitely the canonical way.
Unfortunately, as you learned, there are some gotchas. We tried
substantially increasing the batch size and it worked for a while, until we
reached new scale, and we increased it again, and so forth. It works, but
soon you start getting write timeouts, lots of them. And the thing about
multi-partition batch statements is that they offer atomicity, but not
isolation. This means your database can temporarily be in an inconsistent
state while writes are propagating to the various machines.

For our use case, we could deal with temporary inconsistency, as long as it
was for a strictly bounded period of time, on the order of a few seconds.
Unfortunately, as with all things eventually consistent, it degrades to
"totally inconsistent" when your database is under heavy load and the
time-bounds expand beyond what the application can handle. When a batch
write times out, it often still succeeds (eventually) but your tables can
be inconsistent for

minutes, even while nodetool status shows all nodes up and normal.

But there is another way, that requires us to take a page from our RDBMS
ancestors' book: multi-phase commit.

Similar to logged batch writes, multi-phase commit patterns typically
entail some write amplification cost for the benefit of stronger
consistency guarantees across isolatable units (in Cassandra's case,
*partitions*). However, multi-phase commit offers stronger guarantees that
batch writes, and ALL of the additional write load is completely
distributed as per your load-balancing policy, where as batch writes all go
through one coordinator node, then get written in their entirety to the
batch log on two or three nodes, and then get dispersed in a distributed
fashion from there.

A typical two-phase commit pattern looks like this:

The Write Path

   1. The client code chooses a random UUID.
   2. The client writes the UUID into the IncompleteTransactions table,
   which only has one column, the transactionUUID.
   3. The client makes all of the inserts involved in the transaction, IN
   PARALLEL, with the transactionUUID duplicated in every inserted row.
   4. The client deletes the UUID from IncompleteTransactions table.
   5. The client makes parallel updates to all of the rows it inserted, IN
   PARALLEL, setting the transactionUUID to null.

The Read Path

   1. The client reads some rows from a partition. If this particular
   client request can handle extraneous rows, you are done. If not, read on to
   step #2.
   2. The client gathers the set of unique transactionUUIDs. In the main
   case, they've all been deleted by step #5 in the Write Path. If not, go to
   #3.
   3. For remaining transactionUUIDs (which should be a very small number),
   query the IncompleteTransactions table.
   4. The client code culls rows where the transactionUUID existed in the
   IncompleteTransactions table.

This is just an example, one that is reasonably performant for ledger-style
non-updated inserts. For transactions involving updates to possibly
existing data, more effort is required, generally the client needs to be
smart enough to merge updates based on a timestamp, with a periodic batch
job that cleans out obsolete inserts. If it feels like reinventing the
wheel, that's because it is. But it just might be the quickest path to what
you need.

Thanks,
Cody

On Wed, Dec 7, 2016 at 10:15 AM, Edward Capriolo <edlinuxg...@gmail.com>
wrote:

> I have been circling around a thought process over batches. Now that
> Cassandra has aggregating functions, it might be possible write a type of
> record that has an END_OF_BATCH type marker and the data can be suppressed
> from view until it was all there.
>
> IE you write something like a checksum record that an intelligent client
> can use to tell if the rest of the batch is complete.
>
> On Wed, Dec 7, 2016 at 11:58 AM, Voytek Jarnot <voytek.jar...@gmail.com>
> wrote:
>
>> Been about a month since I have up on it, but it was very much related to
>> the stuff you're dealing with ... Basically Cassandra just stepping on its
>> own.... errrrr, tripping over its own feet streaming MVs.
>>
>> On Dec 7, 2016 10:45 AM, "Benjamin Roth" <benjamin.r...@jaumo.com> wrote:
>>
>>> I meant the mv thing
>>>
>>> Am 07.12.2016 17:27 schrieb "Voytek Jarnot" <voytek.jar...@gmail.com>:
>>>
>>>> Sure, about which part?
>>>>
>>>> default batch size warning is 5kb
>>>> I've increased it to 30kb, and will need to increase to 40kb (8x
>>>> default setting) to avoid WARN log messages about batch sizes.  I do
>>>> realize it's just a WARNing, but may as well avoid those if I can configure
>>>> it out.  That said, having to increase it so substantially (and we're only
>>>> dealing with 5 tables) is making me wonder if I'm not taking the correct
>>>> approach in terms of using batches to guarantee atomicity.
>>>>
>>>> On Wed, Dec 7, 2016 at 10:13 AM, Benjamin Roth <benjamin.r...@jaumo.com
>>>> > wrote:
>>>>
>>>>> Could you please be more specific?
>>>>>
>>>>> Am 07.12.2016 17:10 schrieb "Voytek Jarnot" <voytek.jar...@gmail.com>:
>>>>>
>>>>>> Should've mentioned - running 3.9.  Also - please do not recommend
>>>>>> MVs: I tried, they're broken, we punted.
>>>>>>
>>>>>> On Wed, Dec 7, 2016 at 10:06 AM, Voytek Jarnot <
>>>>>> voytek.jar...@gmail.com> wrote:
>>>>>>
>>>>>>> The low default value for batch_size_warn_threshold_in_kb is making
>>>>>>> me wonder if I'm perhaps approaching the problem of atomicity in a
>>>>>>> non-ideal fashion.
>>>>>>>
>>>>>>> With one data set duplicated/denormalized into 5 tables to support
>>>>>>> queries, we use batches to ensure inserts make it to all or 0 tables.  
>>>>>>> This
>>>>>>> works fine, but I've had to bump the warn threshold and fail threshold
>>>>>>> substantially (8x higher for the warn threshold).  This - in turn - 
>>>>>>> makes
>>>>>>> me wonder, with a default setting so low, if I'm not solving this 
>>>>>>> problem
>>>>>>> in the canonical/standard way.
>>>>>>>
>>>>>>> Mostly just looking for confirmation that we're not unintentionally
>>>>>>> doing something weird...
>>>>>>>
>>>>>>
>>>>>>
>>>>
>

Reply via email to