Re: writing status

Eugene Kirpichov Sat, 09 Sep 2017 10:47:07 -0700

Hi Steve,
Unfortunately for BigQuery it's more complicated than that. Rows aren't
written to BigQuery one by one (unless you're using streaming inserts,
which are way more expensive and are usually used only in streaming
pipelines) - they are written to files, and then a BigQuery import job, or
several import jobs if there are too many files, picks them up. We can
declare writing complete when all of the BigQuery import jobs have
successfully completed.
However, the method of writing is an implementation detail of BigQuery, so
we need to create an API that works regardless of the method (import jobs
vs. streaming inserts).
Another complication is triggering - windows can fire multiple times. This
rules out any approaches that sequence using side inputs, because side
inputs don't have triggering.


I think a common approach could be to return a PCollection<Void>,
containing a Void in every window and pane that has been successfully
written. This could be implemented in both modes and could be a general
design patterns for this sort of thing. It just isn't easy to implement, so
I didn't have time to take it on. It also could turn out to have other
complications we haven't thought of yet.

That said, if somebody tried to implement this for some connectors (not
necessarily BigQuery) and pioneered the approach, it would be a great
contribution.

On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <[email protected]> wrote:

> I wonder if it makes sense to start simple and go from there.  For example,
> I enhanced BigtableIO.Write to output the number of rows written
> in finishBundle(), simply into the global window with the current
> timestamp.  This was more than enough to unblock me, but doesn't support
> more complicated scenarios with windowing.
>
> However, as I said it was more than enough to solve the general batch use
> case, and I imagine could be enhanced to support windowing by keeping track
> of which windows were written per bundle. (can there even ever be more than
> one window per bundle?)
>
> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
> [email protected]> wrote:
>
> > Hi,
> > I was going to implement this, but discussed it with +Reuven Lax
> > <[email protected]> and it appears to be quite difficult to do properly,
> or
> > even to define what it means at all, especially if you're using the
> > streaming inserts write method. So for now there is no workaround except
> > programmatically waiting for your whole pipeline to finish
> > (pipeline.run().waitUntilFinish()).
> >
> > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <[email protected]> wrote:
> >
> > > is there a way around this for now?
> > > how can i get a snapshot version?
> > >
> > > chaim
> > >
> > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
> > > <[email protected]> wrote:
> > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
> > > >
> > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <[email protected]>
> wrote:
> > > >
> > > >> WriteResult does not support apply -> that is the problem
> > > >>
> > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
> > > >> <[email protected]> wrote:
> > > >> > Hi,
> > > >> >
> > > >> > Sorry for the delay. So sounds like you want to do something after
> > > >> writing
> > > >> > a window of data to BigQuery is complete.
> > > >> > I think this should be possible: expansion of BigQueryIO.write()
> > > returns
> > > >> a
> > > >> > WriteResult and you can apply other transforms to it. Have you
> tried
> > > >> that?
> > > >> >
> > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <[email protected]>
> > > wrote:
> > > >> >
> > > >> >> I have documents from a mongo db that i need to migrate to
> > bigquery.
> > > >> >> Since it is mongodb i do not know they schema ahead of time, so i
> > > have
> > > >> >> two pipelines, one to run over the documents and update the
> > bigquery
> > > >> >> schema, then wait a few minutes (i can take for bigquery to be
> able
> > > to
> > > >> >> use the new schema) then with the other pipline copy all the
> > > >> >> documents.
> > > >> >> To know as to where i got with the different piplines i have a
> > status
> > > >> >> table so that at the start i know from where to continue.
> > > >> >> So i need the option to update the status table with the success
> of
> > > >> >> the copy and some time value of the last copied document
> > > >> >>
> > > >> >>
> > > >> >> chaim
> > > >> >>
> > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
> > > >> >> <[email protected]> wrote:
> > > >> >> > I'd like to know more about your both use cases, can you
> > clarify? I
> > > >> think
> > > >> >> > making sinks output something that can be waited on by another
> > > >> pipeline
> > > >> >> > step is a reasonable request, but more details would help
> refine
> > > this
> > > >> >> > suggestion.
> > > >> >> >
> > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
> > > >> [email protected]>
> > > >> >> > wrote:
> > > >> >> >
> > > >> >> >> Can you do this from the program that runs the Beam job, after
> > > job is
> > > >> >> >> complete (you might have to use a blocking runner or poll for
> > the
> > > >> >> status of
> > > >> >> >> the job) ?
> > > >> >> >>
> > > >> >> >> - Cham
> > > >> >> >>
> > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
> > > [email protected]>
> > > >> >> wrote:
> > > >> >> >>
> > > >> >> >> > I also have a similar use case (but with BigTable) that I
> feel
> > > >> like I
> > > >> >> had
> > > >> >> >> > to hack up to make work.  It'd be great to hear if there is
> a
> > > way
> > > >> to
> > > >> >> do
> > > >> >> >> > something like this already, or if there are plans in the
> > > future.
> > > >> >> >> >
> > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
> > [email protected]
> > > >
> > > >> >> wrote:
> > > >> >> >> >
> > > >> >> >> > > Hi,
> > > >> >> >> > >   I have a few piplines that are an ETL from different
> > > systems to
> > > >> >> >> > bigquery.
> > > >> >> >> > > I would like to write the status of the ETL after all
> > records
> > > >> have
> > > >> >> >> > > been updated to the bigquery.
> > > >> >> >> > > The problem is that writing to bigquery is a sink and you
> > > cannot
> > > >> >> have
> > > >> >> >> > > any other steps after the sink.
> > > >> >> >> > > I tried a sideoutput, but this is called in no correlation
> > to
> > > the
> > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
> > > failed.
> > > >> >> >> > >
> > > >> >> >> > >
> > > >> >> >> > > any ideas?
> > > >> >> >> > > chaim
> > > >> >> >> > >
> > > >> >> >> >
> > > >> >> >>
> > > >> >>
> > > >>
> > >
> >
>

Re: writing status

Reply via email to