Re: BigQueryIO Partitions

Reuven Lax Wed, 27 Sep 2017 12:48:17 -0700

There are a couple of options, and if you provide a job id (since you are
using the Dataflow runner) we can better advise.


If you are writing to more than 2000 partitions, this won't work - BigQuery
has a hard quota of 1000 partition updates per table per day.

If you have fewer than 1000 jobs, there are a few possibilities. It's
possible that BigQuery is taking a while to schedule some of those jobs;
they'll end up sitting in a queue waiting to be scheduled. We can look at
one of the jobs in detail to see if that's happening. Eugene's suggestion
of using your pipeline to load into a single table might be the best one.
You can write the date into a separate column, and then write a shell
script to copy each date to it's own partition (see
https://cloud.google.com/bigquery/docs/creating-partitioned-tables#update-with-query
for some examples).

On Wed, Sep 27, 2017 at 11:39 AM, Eugene Kirpichov <
[email protected]> wrote:

> I see. Then Reuven's answer above applies.
> Maybe you could write to a non-partitioned table, and then split it into
> smaller partitioned tables. See https://stackoverflow.com/a/
> 39001706/278042
> <https://stackoverflow.com/a/39001706/278042ащк> for a discussion of the
> current options - granted, it seems like there currently don't exist very
> good options for creating a very large number of table partitions from
> existing data.
>
> On Wed, Sep 27, 2017 at 4:01 AM Chaim Turkel <[email protected]> wrote:
>
> > thank you for your detailed response.
> > Currently i am a bit stuck.
> > I need to migrate data from mongo to bigquery, we have about 1 terra
> > of data. It is history data, so i want to use bigquery partitions.
> > It seems that the io connector creates a job per partition so it takes
> > a very long time, and i hit the quota in bigquery of the amount of
> > jobs per day.
> > I would like to use streaming but you cannot stream old data more than 30
> > day
> >
> > So I thought of partitions to see if i can do more parraleism
> >
> > chaim
> >
> >
> > On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov
> > <[email protected]> wrote:
> > > Okay, I see - there's about 3 different meanings of the word
> "partition"
> > > that could have been involved here (BigQuery partitions,
> runner-specific
> > > bundles, and the Partition transform), hence my request for
> > clarification.
> > >
> > > If you mean the Partition transform - then I'm confused what do you
> mean
> > by
> > > BigQueryIO "supporting" it? The Partition transform takes a PCollection
> > and
> > > produces a bunch of PCollections; these are ordinary PCollection's and
> > you
> > > can apply any Beam transforms to them, and BigQueryIO.write() is no
> > > exception to this - you can apply it too.
> > >
> > > To answer whether using Partition would improve your performance, I'd
> > need
> > > to understand exactly what you're comparing against what. I suppose
> > you're
> > > comparing the following:
> > > 1) Applying BigQueryIO.write() to a PCollection, writing to a single
> > table
> > > 2) Splitting a PCollection into several smaller PCollection's using
> > > Partition, and applying BigQueryIO.write() to each of them, writing to
> > > different tables I suppose? (or do you want to write to different
> > BigQuery
> > > partitions of the same table using a table partition decorator?)
> > > I would expect #2 to perform strictly worse than #1, because it writes
> > the
> > > same amount of data but increases the number of BigQuery load jobs
> > involved
> > > (thus increases per-job overhead and consumes BigQuery quota).
> > >
> > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]>
> wrote:
> > >
> > >> https://beam.apache.org/documentation/programming-guide/#partition
> > >>
> > >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov
> > >> <[email protected]> wrote:
> > >> > What do you mean by Beam partitions?
> > >> >
> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]>
> wrote:
> > >> >
> > >> >> by the way currently the performance on bigquery partitions is very
> > bad.
> > >> >> Is there a repository where i can test with 2.2.0?
> > >> >>
> > >> >> chaim
> > >> >>
> > >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax
> <[email protected]
> > >
> > >> >> wrote:
> > >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if
> > the
> > >> >> table
> > >> >> > containing the partitions is not pre created (fixed in 2.2.0).
> > >> >> >
> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <[email protected]>
> > >> wrote:
> > >> >> >
> > >> >> >> Hi,
> > >> >> >>
> > >> >> >>    Does BigQueryIO support Partitions when writing? will it
> > improve
> > >> my
> > >> >> >> performance?
> > >> >> >>
> > >> >> >>
> > >> >> >> chaim
> > >> >> >>
> > >> >>
> > >>
> >
>

Re: BigQueryIO Partitions

Reply via email to