State of Dataflow pipelines is not maintained across different runs of a pipeline. I think here also you can add a custom ParDo that stores such state in an external storage system and retrieve that state when starting up a fresh pipeline.
Thanks, Cham On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <dev.vasugu...@gmail.com> wrote: > Hey folks, > > While using BigQueryIO for 10k tables insertion, I found that it has an > issue in it's local caching technique for table creation. Tables are first > search in BigQueryIO's local cache and then checks whether to create a > table or not. The main issue is when inserting to thousands of table: let's > suppose we have 10k tables to insert in realtime and now since we will > deploy a fresh dataflow pipeline once in a week, local cache will be empty > and it will take a huge time just to build that cache for 10k tables even > though these 10k tables were already created in BigQuery. > > The solution i could propose for this is we can provide an option for > using external caching services like Redis/Memcached so that we don't have > to rebuild cache again and again after a fresh deployment of pipeline. >