Re: Caching issue in BigQueryIO

Chamikara Jayalath Wed, 02 Dec 2020 19:31:04 -0800

State of Dataflow pipelines is not maintained across different runs of a
pipeline. I think here also you can add a custom ParDo that stores such
state in an external storage system and retrieve that state when starting
up a fresh pipeline.


Thanks,
Cham

On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <[email protected]> wrote:

> Hey folks,
>
> While using BigQueryIO for 10k tables insertion, I found that it has an
> issue in it's local caching technique for table creation. Tables are first
> search in BigQueryIO's local cache and then checks whether to create a
> table or not. The main issue is when inserting to thousands of table: let's
> suppose we have 10k tables to insert in realtime and now since we will
> deploy a fresh dataflow pipeline once in a week, local cache will be empty
> and it will take a huge time just to build that cache for 10k tables even
> though these 10k tables were already created in BigQuery.
>
> The solution i could propose for this is we can provide an option for
> using external caching services like Redis/Memcached so that we don't have
> to rebuild cache again and again after a fresh deployment of pipeline.
>

Re: Caching issue in BigQueryIO

Reply via email to