Re: Beam BigtableIO versus Google CloudBigtableIO

Diego Gomez via dev Tue, 16 Aug 2022 12:37:06 -0700

Sounds good!

In regards to the second paragraph, it is true that there was a recent
change to the amount of mutations in a batch. I would still recommend using
bulkOptions and withBigtableOptionsConfigurator(), I believe that the field
'BIGTABLE_BULK_MAX_ROW_KEY_COUNT_DEFAULT' may be what you are looking for.
I can't really advise a specific number since each use case is different,
but just experiment and determine which size is best suited for your
workload.


More information on the different bulkOptions here:
https://cloud.google.com/bigtable/docs/hbase-client/javadoc/com/google/cloud/bigtable/config/BulkOptions.html

-Diego

On Tue, Aug 16, 2022 at 1:47 PM Sahith Nallapareddy <[email protected]>
wrote:

> Hello Diego,
>
> Right now we are using BigtableIO so I will continue to use that one!
>
> For the second part, Ill explain a bit more what we saw as I simplified a
> bit in my original email. At some point we had two streaming pipelines
> writing to bigtable and we decided to combine these into one pipeline that
> writes to multiple Bigtables. What we found is that our network traffic to
> bigtable did go up by a bit more than 3x than when the pipelines separated.
> Our nodes were about the same now looking back I think I misremembered that
> part. We opened a google ticket at the time to see what we could do to
> remedy this as we didnt expect that much of a cost increase and they told
> us that this was due to the new implementation batching less mutations
> (causing more write requests) than the old. We were advised to mess with
> the bulk options, but we did not really get a chance to yet so I will try
> that at some point. I was wondering if anyone could shed light if that is
> the best way to configure how much bigtable batches requests or is there
> more that could be done.
>
> Thanks,
>
> Sahith
>
> On Tue, Aug 16, 2022 at 1:04 PM Diego Gomez <[email protected]> wrote:
>
>> Hello Sahith,
>>
>> We recommend using BigtableIO over CloudBigtableIO. Both of them have
>> similar performances and main differences being than CloudBigtableIO uses
>> HBase Result and Puts, while BigtableIO uses protos to read results and
>> mutations.
>>
>> The two connectors should result in similar spending on Bigtable's side,
>> more write requests doesn't necessarily mean more cost/nodes. What version
>> of CloudBigtableIO are you using and are you using an autoscaling CBT
>> cluster?
>>
>> -Diego
>>
>> On Tue, Aug 16, 2022 at 11:55 AM Sahith Nallapareddy via dev <
>> [email protected]> wrote:
>>
>>> Hello,
>>>
>>> I see that there are two implementations of reading and writing from
>>> Bigtable, one in beam and one that is references in Google cloud
>>> documentation. Is one preferred over the other? We often use the Beam
>>> BigtableIO to write to bigtable but I have found that sometimes the default
>>> configuration can lead to a lot of write requests (which can lead to having
>>> more nodes as well it seems, more cost associated). I am about to try
>>> messing around with the bulk options to see if that can raise the batching
>>> of mutations, but is there anything else I should try, like switching the
>>> actual transform we use?
>>>
>>> Thanks,
>>>
>>> Sahith
>>>
>>

Re: Beam BigtableIO versus Google CloudBigtableIO

Reply via email to