Re: writing status

Chaim Turkel Sat, 09 Sep 2017 22:24:17 -0700

so how does the getFailedInserts method work (though from what i saw
it does not work)


chaim

On Sat, Sep 9, 2017 at 9:49 PM, Reuven Lax <[email protected]> wrote:
> I'm still not sure how this would work (or even make sense) for the
> streaming-write path.
>
> Also in both paths, the actual write to BigQuery is unwindowed.
>
> On Sat, Sep 9, 2017 at 11:44 AM, Eugene Kirpichov <[email protected]>
> wrote:
>
>> There'd be 1 Void per pane per window, so I could extract information
>> about whether this is the first pane, last pane, or something else - there
>> are probably use cases for each of these.
>>
>> On Sat, Sep 9, 2017 at 11:37 AM Reuven Lax <[email protected]> wrote:
>>
>>> How would you know how many Voids to wait for downstream?
>>>
>>> On Sat, Sep 9, 2017 at 10:46 AM, Eugene Kirpichov <[email protected]>
>>> wrote:
>>>
>>>> Hi Steve,
>>>> Unfortunately for BigQuery it's more complicated than that. Rows aren't
>>>> written to BigQuery one by one (unless you're using streaming inserts,
>>>> which are way more expensive and are usually used only in streaming
>>>> pipelines) - they are written to files, and then a BigQuery import job, or
>>>> several import jobs if there are too many files, picks them up. We can
>>>> declare writing complete when all of the BigQuery import jobs have
>>>> successfully completed.
>>>> However, the method of writing is an implementation detail of BigQuery,
>>>> so we need to create an API that works regardless of the method (import
>>>> jobs vs. streaming inserts).
>>>> Another complication is triggering - windows can fire multiple times.
>>>> This rules out any approaches that sequence using side inputs, because side
>>>> inputs don't have triggering.
>>>>
>>>> I think a common approach could be to return a PCollection<Void>,
>>>> containing a Void in every window and pane that has been successfully
>>>> written. This could be implemented in both modes and could be a general
>>>> design patterns for this sort of thing. It just isn't easy to implement, so
>>>> I didn't have time to take it on. It also could turn out to have other
>>>> complications we haven't thought of yet.
>>>>
>>>> That said, if somebody tried to implement this for some connectors (not
>>>> necessarily BigQuery) and pioneered the approach, it would be a great
>>>> contribution.
>>>>
>>>> On Sat, Sep 9, 2017 at 9:41 AM Steve Niemitz <[email protected]>
>>>> wrote:
>>>>
>>>>> I wonder if it makes sense to start simple and go from there.  For
>>>>> example,
>>>>> I enhanced BigtableIO.Write to output the number of rows written
>>>>> in finishBundle(), simply into the global window with the current
>>>>> timestamp.  This was more than enough to unblock me, but doesn't support
>>>>> more complicated scenarios with windowing.
>>>>>
>>>>> However, as I said it was more than enough to solve the general batch
>>>>> use
>>>>> case, and I imagine could be enhanced to support windowing by keeping
>>>>> track
>>>>> of which windows were written per bundle. (can there even ever be more
>>>>> than
>>>>> one window per bundle?)
>>>>>
>>>>> On Fri, Sep 8, 2017 at 2:32 PM, Eugene Kirpichov <
>>>>> [email protected]> wrote:
>>>>>
>>>>> > Hi,
>>>>> > I was going to implement this, but discussed it with +Reuven Lax
>>>>> > <[email protected]> and it appears to be quite difficult to do
>>>>> properly, or
>>>>> > even to define what it means at all, especially if you're using the
>>>>> > streaming inserts write method. So for now there is no workaround
>>>>> except
>>>>> > programmatically waiting for your whole pipeline to finish
>>>>> > (pipeline.run().waitUntilFinish()).
>>>>> >
>>>>> > On Fri, Sep 8, 2017 at 2:19 AM Chaim Turkel <[email protected]> wrote:
>>>>> >
>>>>> > > is there a way around this for now?
>>>>> > > how can i get a snapshot version?
>>>>> > >
>>>>> > > chaim
>>>>> > >
>>>>> > > On Tue, Sep 5, 2017 at 8:48 AM, Eugene Kirpichov
>>>>> > > <[email protected]> wrote:
>>>>> > > > Oh I see! Okay, this should be easy to fix. I'll take a look.
>>>>> > > >
>>>>> > > > On Mon, Sep 4, 2017 at 10:23 PM Chaim Turkel <[email protected]>
>>>>> wrote:
>>>>> > > >
>>>>> > > >> WriteResult does not support apply -> that is the problem
>>>>> > > >>
>>>>> > > >> On Tue, Sep 5, 2017 at 4:59 AM, Eugene Kirpichov
>>>>> > > >> <[email protected]> wrote:
>>>>> > > >> > Hi,
>>>>> > > >> >
>>>>> > > >> > Sorry for the delay. So sounds like you want to do something
>>>>> after
>>>>> > > >> writing
>>>>> > > >> > a window of data to BigQuery is complete.
>>>>> > > >> > I think this should be possible: expansion of
>>>>> BigQueryIO.write()
>>>>> > > returns
>>>>> > > >> a
>>>>> > > >> > WriteResult and you can apply other transforms to it. Have you
>>>>> tried
>>>>> > > >> that?
>>>>> > > >> >
>>>>> > > >> > On Sat, Aug 26, 2017 at 1:10 PM Chaim Turkel <[email protected]
>>>>> >
>>>>> > > wrote:
>>>>> > > >> >
>>>>> > > >> >> I have documents from a mongo db that i need to migrate to
>>>>> > bigquery.
>>>>> > > >> >> Since it is mongodb i do not know they schema ahead of time,
>>>>> so i
>>>>> > > have
>>>>> > > >> >> two pipelines, one to run over the documents and update the
>>>>> > bigquery
>>>>> > > >> >> schema, then wait a few minutes (i can take for bigquery to
>>>>> be able
>>>>> > > to
>>>>> > > >> >> use the new schema) then with the other pipline copy all the
>>>>> > > >> >> documents.
>>>>> > > >> >> To know as to where i got with the different piplines i have a
>>>>> > status
>>>>> > > >> >> table so that at the start i know from where to continue.
>>>>> > > >> >> So i need the option to update the status table with the
>>>>> success of
>>>>> > > >> >> the copy and some time value of the last copied document
>>>>> > > >> >>
>>>>> > > >> >>
>>>>> > > >> >> chaim
>>>>> > > >> >>
>>>>> > > >> >> On Fri, Aug 25, 2017 at 6:53 PM, Eugene Kirpichov
>>>>> > > >> >> <[email protected]> wrote:
>>>>> > > >> >> > I'd like to know more about your both use cases, can you
>>>>> > clarify? I
>>>>> > > >> think
>>>>> > > >> >> > making sinks output something that can be waited on by
>>>>> another
>>>>> > > >> pipeline
>>>>> > > >> >> > step is a reasonable request, but more details would help
>>>>> refine
>>>>> > > this
>>>>> > > >> >> > suggestion.
>>>>> > > >> >> >
>>>>> > > >> >> > On Fri, Aug 25, 2017, 8:46 AM Chamikara Jayalath <
>>>>> > > >> [email protected]>
>>>>> > > >> >> > wrote:
>>>>> > > >> >> >
>>>>> > > >> >> >> Can you do this from the program that runs the Beam job,
>>>>> after
>>>>> > > job is
>>>>> > > >> >> >> complete (you might have to use a blocking runner or poll
>>>>> for
>>>>> > the
>>>>> > > >> >> status of
>>>>> > > >> >> >> the job) ?
>>>>> > > >> >> >>
>>>>> > > >> >> >> - Cham
>>>>> > > >> >> >>
>>>>> > > >> >> >> On Fri, Aug 25, 2017 at 8:44 AM Steve Niemitz <
>>>>> > > [email protected]>
>>>>> > > >> >> wrote:
>>>>> > > >> >> >>
>>>>> > > >> >> >> > I also have a similar use case (but with BigTable) that
>>>>> I feel
>>>>> > > >> like I
>>>>> > > >> >> had
>>>>> > > >> >> >> > to hack up to make work.  It'd be great to hear if there
>>>>> is a
>>>>> > > way
>>>>> > > >> to
>>>>> > > >> >> do
>>>>> > > >> >> >> > something like this already, or if there are plans in the
>>>>> > > future.
>>>>> > > >> >> >> >
>>>>> > > >> >> >> > On Fri, Aug 25, 2017 at 9:46 AM, Chaim Turkel <
>>>>> > [email protected]
>>>>> > > >
>>>>> > > >> >> wrote:
>>>>> > > >> >> >> >
>>>>> > > >> >> >> > > Hi,
>>>>> > > >> >> >> > >   I have a few piplines that are an ETL from different
>>>>> > > systems to
>>>>> > > >> >> >> > bigquery.
>>>>> > > >> >> >> > > I would like to write the status of the ETL after all
>>>>> > records
>>>>> > > >> have
>>>>> > > >> >> >> > > been updated to the bigquery.
>>>>> > > >> >> >> > > The problem is that writing to bigquery is a sink and
>>>>> you
>>>>> > > cannot
>>>>> > > >> >> have
>>>>> > > >> >> >> > > any other steps after the sink.
>>>>> > > >> >> >> > > I tried a sideoutput, but this is called in no
>>>>> correlation
>>>>> > to
>>>>> > > the
>>>>> > > >> >> >> > > writing to bigquery, so i don't know if it succeeded or
>>>>> > > failed.
>>>>> > > >> >> >> > >
>>>>> > > >> >> >> > >
>>>>> > > >> >> >> > > any ideas?
>>>>> > > >> >> >> > > chaim
>>>>> > > >> >> >> > >
>>>>> > > >> >> >> >
>>>>> > > >> >> >>
>>>>> > > >> >>
>>>>> > > >>
>>>>> > >
>>>>> >
>>>>>
>>>>
>>>

Re: writing status

Reply via email to