Re: writing status

Reuven Lax Sun, 10 Sep 2017 08:59:54 -0700

No, windowing is supported on BigQuery.

The BigQueryIO transform is divided into two parts. The first part takes
your input elements and decides which tables they should be written to.
This can be based on the windows that the elements are in (as well as the
data in the elements themselves if you want).


The second part of the transform actually writes the data to BigQuery.
Since we've already used the windowing information to decide which tables
to write the elements to, this second part does not need windowing
information any more. It is written in terms of Beam primitives (e.g.
GroupByKey) that we don't want affected by windowing; at this point we want
to write the data to the target BigQuery tables as fast as possible. So
this second half of the transform rewindows all the data in GlobalWindows
before writing to BigQuery.

As for getFailedInserts, it only is supported for the streaming path. The
reason why is that it does not make sense for the batch path. The batch
path works by writing all elements to files in GCS, and then telling
BigQuery to import all these files. If there's a failure, we don't get any
information about which row failed, all we know is that the entire import
job failed. In streaming on the other hand we insert the rows one by one,
and when there's a failure we get detailed information on which row failed
and can return it in getFailedInserts.

Reuven

On Sat, Sep 9, 2017 at 10:25 PM, Chaim Turkel <[email protected]> wrote:

> so what you are saying is that windowing is not supported on the
> bigquery? this does not make sense since i am using it for the table
> partition, and that works fine?
>
> On Sat, Sep 9, 2017 at 7:40 PM, Steve Niemitz <[email protected]> wrote:
> > I wonder if it makes sense to start simple and go from there.  For
> example,
> > I enhanced BigtableIO.Write to output the number of rows written
> > in finishBundle(), simply into the global window with the current
> > timestamp.  This was more than enough to unblock me, but doesn't support
> > more complicated scenarios with windowing.
> >
> > However, as I said it was more than enough to solve the general batch use
> > case, and I imagine could be enhanced to support windowing by keeping
> track
> > of which windows were written per bundle. (can there even ever be more
> than
> > one window per bundle?)
> >
> > On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> > [email protected]> wrote:
> >
> >> Hi,
> >> I was going to implement this, but discussed it with +Reuven Lax
> >> <[email protected]> and it appears to be quite difficult to do
> properly, or
> >> even to define what it means at all, especially if you're using the
> >> streaming inserts write method. So for now there is no workaround except
> >> programmatically waiting for your whole pipeline to finish
> >> (pipeline.run().waitUntilFinish()).
> >>
> >> On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <[email protected]> wrote:
> >>
> >> > is there a way around this for now?
> >> > how can i get a snapshot version?
> >> >
> >> > chaim
> >> >
> >> > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> >> > <[email protected]> wrote:
> >> > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> >> > >
> >> > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <[email protected]>
> wrote:
> >> > >
> >> > >> WriteResult does not support apply -> that is the problem
> >> > >>
> >> > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> >> > >> <[email protected]> wrote:
> >> > >> > Hi,
> >> > >> >
> >> > >> > Sorry for the delay. So sounds like you want to do something
> after
> >> > >> writing
> >> > >> > a window of data to BigQuery is complete.
> >> > >> > I think this should be possible: expansion of BigQueryIO.write()
> >> > returns
> >> > >> a
> >> > >> > WriteResult and you can apply other transforms to it. Have you
> tried
> >> > >> that?
> >> > >> >
> >> > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <[email protected]>
> >> > wrote:
> >> > >> >
> >> > >> >> I have documents from a mongo db that i need to migrate to
> >> bigquery.
> >> > >> >> Since it is mongodb i do not know they schema ahead of time, so
> i
> >> > have
> >> > >> >> two pipelines, one to run over the documents and update the
> >> bigquery
> >> > >> >> schema, then wait a few minutes (i can take for bigquery to be
> able
> >> > to
> >> > >> >> use the new schema) then with the other pipline copy all the
> >> > >> >> documents.
> >> > >> >> To know as to where i got with the different piplines i have a
> >> status
> >> > >> >> table so that at the start i know from where to continue.
> >> > >> >> So i need the option to update the status table with the
> success of
> >> > >> >> the copy and some time value of the last copied document
> >> > >> >>
> >> > >> >>
> >> > >> >> chaim
> >> > >> >>
> >> > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> >> > >> >> <[email protected]> wrote:
> >> > >> >> > I'd like to know more about your both use cases, can you
> >> clarify? I
> >> > >> think
> >> > >> >> > making sinks output something that can be waited on by another
> >> > >> pipeline
> >> > >> >> > step is a reasonable request, but more details would help
> refine
> >> > this
> >> > >> >> > suggestion.
> >> > >> >> >
> >> > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> >> > >> [email protected]>
> >> > >> >> > wrote:
> >> > >> >> >
> >> > >> >> >> Can you do this from the program that runs the Beam job,
> after
> >> > job is
> >> > >> >> >> complete (you might have to use a blocking runner or poll for
> >> the
> >> > >> >> status of
> >> > >> >> >> the job) ?
> >> > >> >> >>
> >> > >> >> >> - Cham
> >> > >> >> >>
> >> > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> >> > [email protected]>
> >> > >> >> wrote:
> >> > >> >> >>
> >> > >> >> >> > I also have a similar use case (but with BigTable) that I
> feel
> >> > >> like I
> >> > >> >> had
> >> > >> >> >> > to hack up to make work.  It'd be great to hear if there
> is a
> >> > way
> >> > >> to
> >> > >> >> do
> >> > >> >> >> > something like this already, or if there are plans in the
> >> > future.
> >> > >> >> >> >
> >> > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> >> [email protected]
> >> > >
> >> > >> >> wrote:
> >> > >> >> >> >
> >> > >> >> >> > > Hi,
> >> > >> >> >> > >   I have a few piplines that are an ETL from different
> >> > systems to
> >> > >> >> >> > bigquery.
> >> > >> >> >> > > I would like to write the status of the ETL after all
> >> records
> >> > >> have
> >> > >> >> >> > > been updated to the bigquery.
> >> > >> >> >> > > The problem is that writing to bigquery is a sink and you
> >> > cannot
> >> > >> >> have
> >> > >> >> >> > > any other steps after the sink.
> >> > >> >> >> > > I tried a sideoutput, but this is called in no
> correlation
> >> to
> >> > the
> >> > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> >> > failed.
> >> > >> >> >> > >
> >> > >> >> >> > >
> >> > >> >> >> > > any ideas?
> >> > >> >> >> > > chaim
> >> > >> >> >> > >
> >> > >> >> >> >
> >> > >> >> >>
> >> > >> >>
> >> > >>
> >> >
> >>
>

Re: writing status

Reply via email to