Hey Reuven, yes you are correct that BigQueryIO is working intended but the issue is that since it's a local cache, this cache will rebuild again from sratch when pipeline is redeployed which is very time consuming for thousands of table.
On 2020/12/03 17:58:04, Reuven Lax <[email protected]> wrote: > What exactly is the issue? If the cache is empty, then BigQueryIO will try > and create the table again, and the creation will fail since the table > exists. This is working as intended. > > The only reason for the cache is so that BigQueryIO doesn't continuously > hammer BigQuery with creation requests every second. > > On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <[email protected]> wrote: > > > Hey folks, > > > > While using BigQueryIO for 10k tables insertion, I found that it has an > > issue in it's local caching technique for table creation. Tables are first > > search in BigQueryIO's local cache and then checks whether to create a > > table or not. The main issue is when inserting to thousands of table: let's > > suppose we have 10k tables to insert in realtime and now since we will > > deploy a fresh dataflow pipeline once in a week, local cache will be empty > > and it will take a huge time just to build that cache for 10k tables even > > though these 10k tables were already created in BigQuery. > > > > The solution i could propose for this is we can provide an option for > > using external caching services like Redis/Memcached so that we don't have > > to rebuild cache again and again after a fresh deployment of pipeline. > > >
