How long does it take to rebuild? Even for thousands of tables I would not
expect it to take very long, unless you are hitting quota rate limits with
BigQuery. If that's the case, maybe a better solution is to see if those
quotas could be raised?

On Fri, Dec 4, 2020 at 9:57 AM Vasu Gupta <[email protected]> wrote:

> Hey Reuven, yes you are correct that BigQueryIO is working intended but
> the issue is that since it's a local cache, this cache will rebuild again
> from sratch when pipeline is redeployed which is very time consuming for
> thousands of table.
>
> On 2020/12/03 17:58:04, Reuven Lax <[email protected]> wrote:
> > What exactly is the issue? If the cache is empty, then BigQueryIO will
> try
> > and create the table again, and the creation will fail since the table
> > exists. This is working as intended.
> >
> > The only reason for the cache is so that BigQueryIO doesn't continuously
> > hammer BigQuery with creation requests every second.
> >
> > On Wed, Dec 2, 2020 at 3:20 PM Vasu Gupta <[email protected]>
> wrote:
> >
> > > Hey folks,
> > >
> > > While using BigQueryIO for 10k tables insertion, I found that it has an
> > > issue in it's local caching technique for table creation. Tables are
> first
> > > search in BigQueryIO's local cache and then checks whether to create a
> > > table or not. The main issue is when inserting to thousands of table:
> let's
> > > suppose we have 10k tables to insert in realtime and now since we will
> > > deploy a fresh dataflow pipeline once in a week, local cache will be
> empty
> > > and it will take a huge time just to build that cache for 10k tables
> even
> > > though these 10k tables were already created in BigQuery.
> > >
> > > The solution i could propose for this is we can provide an option for
> > > using external caching services like Redis/Memcached so that we don't
> have
> > > to rebuild cache again and again after a fresh deployment of pipeline.
> > >
> >
>

Reply via email to