thank you for your detailed response. Currently i am a bit stuck. I need to migrate data from mongo to bigquery, we have about 1 terra of data. It is history data, so i want to use bigquery partitions. It seems that the io connector creates a job per partition so it takes a very long time, and i hit the quota in bigquery of the amount of jobs per day. I would like to use streaming but you cannot stream old data more than 30 day
So I thought of partitions to see if i can do more parraleism chaim On Wed, Sep 27, 2017 at 9:49 AM, Eugene Kirpichov <[email protected]> wrote: > Okay, I see - there's about 3 different meanings of the word "partition" > that could have been involved here (BigQuery partitions, runner-specific > bundles, and the Partition transform), hence my request for clarification. > > If you mean the Partition transform - then I'm confused what do you mean by > BigQueryIO "supporting" it? The Partition transform takes a PCollection and > produces a bunch of PCollections; these are ordinary PCollection's and you > can apply any Beam transforms to them, and BigQueryIO.write() is no > exception to this - you can apply it too. > > To answer whether using Partition would improve your performance, I'd need > to understand exactly what you're comparing against what. I suppose you're > comparing the following: > 1) Applying BigQueryIO.write() to a PCollection, writing to a single table > 2) Splitting a PCollection into several smaller PCollection's using > Partition, and applying BigQueryIO.write() to each of them, writing to > different tables I suppose? (or do you want to write to different BigQuery > partitions of the same table using a table partition decorator?) > I would expect #2 to perform strictly worse than #1, because it writes the > same amount of data but increases the number of BigQuery load jobs involved > (thus increases per-job overhead and consumes BigQuery quota). > > On Tue, Sep 26, 2017 at 11:35 PM Chaim Turkel <[email protected]> wrote: > >> https://beam.apache.org/documentation/programming-guide/#partition >> >> On Tue, Sep 26, 2017 at 6:42 PM, Eugene Kirpichov >> <[email protected]> wrote: >> > What do you mean by Beam partitions? >> > >> > On Tue, Sep 26, 2017, 6:57 AM Chaim Turkel <[email protected]> wrote: >> > >> >> by the way currently the performance on bigquery partitions is very bad. >> >> Is there a repository where i can test with 2.2.0? >> >> >> >> chaim >> >> >> >> On Tue, Sep 26, 2017 at 4:52 PM, Reuven Lax <[email protected]> >> >> wrote: >> >> > Do you mean BigQuery partitions? Yes, however 2.1.0 has a bug if the >> >> table >> >> > containing the partitions is not pre created (fixed in 2.2.0). >> >> > >> >> > On Tue, Sep 26, 2017 at 6:40 AM, Chaim Turkel <[email protected]> >> wrote: >> >> > >> >> >> Hi, >> >> >> >> >> >> Does BigQueryIO support Partitions when writing? will it improve >> my >> >> >> performance? >> >> >> >> >> >> >> >> >> chaim >> >> >> >> >> >>
