output transform

Eugene Kirpichov Wed, 20 Dec 2017 13:57:44 -0800

PR is out https://github.com/apache/beam/pull/4301


This should allow us to have useful sequencing for sinks like BigtableIO /
BigQueryIO.

Adding a couple of interested parties:
- Steve, would you be interested in using this in
https://github.com/apache/beam/pull/3997 ?
- Mairbek: this should help in https://github.com/apache/beam/pull/4264 -
in particular, this works properly in case the input can be firing multiple
times.

On Tue, Dec 19, 2017 at 5:20 PM Eugene Kirpichov <[email protected]>
wrote:

> I figured out the Never.ever() approach and it seems to work. Will finish
> this up and send a PR at some point. Woohoo, thanks Kenn! Seems like this
> will be quite a useful transform.
>
> On Mon, Dec 18, 2017 at 1:23 PM Eugene Kirpichov <[email protected]>
> wrote:
>
>> I'm a bit confused by all of these suggestions: they sound plausible at a
>> high level, but I'm having a hard time making any one of them concrete.
>>
>> So suppose we want to create a transform Wait.on(PCollection<?> signal):
>> PCollection<T> -> PCollection<T>.
>> a.apply(Wait.on(sig)) returns a PCollection that is mostly identical to
>> "a", but buffers panes of "a" in any given window until the final pane of
>> "sig" in the same window is fired (or, if it's never fired, until the
>> window closes? could use a deadletter for that maybe).
>>
>> This transform I suppose would need to have a keyed and unkeyed version.
>>
>> The keyed version would support merging window fns, and would require "a"
>> and "sig" to be keyed by the same key, and would work using a CoGbk -
>> followed by a stateful ParDo? Or is there a way to get away without a
>> stateful ParDo here? (not all runners support it)
>>
>> The unkeyed version would not support merging window fns. Reuven, can you
>> elaborate how your combiner idea would work here - in particular, what do
>> you mean by "triggering only on the final pane"? Do you mean filter
>> non-final panes before entering the combiner? I wonder if that'll work,
>> probably worth a shot. And Kenn, can you elaborate on "re-trigger on the
>> side input with a Never.ever() trigger"?
>>
>> Thanks.
>>
>> On Sun, Dec 17, 2017 at 1:28 PM Reuven Lax <[email protected]> wrote:
>>
>>> This is an interesting point.
>>>
>>> In the past, we've often just though about sequencing some action to
>>> take place after the sink, in which case you can simply use the sink output
>>> as a main input. However if you want to run a transform with another
>>> PCollection as a main input, this doesn't work. And as you've discovered,
>>> triggered side inputs are defined to be non-deterministic, and there's no
>>> way to make things line up.
>>>
>>> What you're describing only makes sense if you're blocking against the
>>> final pane (since otherwise there's no reasonable way to match up somePC
>>> panes with the sink panes). There are multiple ways you can do this: one
>>> would be to CGBK the two PCollections together, and trigger the new
>>> transform only on the final pane. Another would be to add a combiner that
>>> returns a Void, triggering only on the final pane, and then make this
>>> singleton Void a side input. You could also do something explicit with the
>>> state API.
>>>
>>> Reuven
>>>
>>> On Fri, Dec 15, 2017 at 5:31 PM, Eugene Kirpichov <[email protected]>
>>> wrote:
>>>
>>>> So this appears not as easy as anticipated (surprise!)
>>>>
>>>> Suppose we have a PCollection "donePanes" with an element per
>>>> fully-processed pane: e.g. BigQuery sink, and elements saying "a pane of
>>>> data has been written; this pane is: final / non-final".
>>>>
>>>> Suppose we want to use this to ensure that somePc.apply(ParDo.of(fn))
>>>> happens only after the final pane has been written.
>>>>
>>>> In other words: we want a.apply(ParDo.of(b).withSideInput(c)) to happen
>>>> when c emits a *final* pane.
>>>>
>>>> Unfortunately, using
>>>> ParDo.of(fn).withSideInputs(donePanes.apply(View.asSingleton())) doesn't do
>>>> the trick: the side input becomes ready the moment *the first *pane of
>>>> data has been written.
>>>>
>>>> But neither does ParDo.of(fn).withSideInputs(donePanes.apply(...filter
>>>> only final panes...).apply(View.asSingleton())). It also becomes ready the
>>>> moment *the first* pane has been written, you just get an exception if
>>>> you access the side input before the *final* pane was written.
>>>>
>>>> I can't think of a pure-Beam solution to this: either "donePanes" will
>>>> be used as a main input to something (and then everything else can only be
>>>> a side input, which is not general enough), or it will be used as a side
>>>> input (and then we can't achieve "trigger only after the final pane 
>>>> fires").
>>>>
>>>> It seems that we need a way to control the side input pushback, and
>>>> configure whether a view becomes ready when its first pane has fired or
>>>> when its last pane has fired. I could see this be a property on the View
>>>> transform itself. In terms of implementation - I tried to figure out how
>>>> side input readiness is determined, in the direct runner and Dataflow
>>>> runner, and I'm completely lost and would appreciate some help.
>>>>
>>>> On Thu, Dec 7, 2017 at 12:01 AM Reuven Lax <[email protected]> wrote:
>>>>
>>>>> This sounds great!
>>>>>
>>>>> On Mon, Dec 4, 2017 at 4:34 PM, Ben Chambers <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> This would be absolutely great! It seems somewhat similar to the
>>>>>> changes that were made to the BigQuery sink to support WriteResult (
>>>>>> https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigquery/WriteResult.java
>>>>>> ).
>>>>>>
>>>>>> I find it helpful to think about the different things that may come
>>>>>> after a sink. For instance:
>>>>>>
>>>>>> 1. It might be helpful to have a collection of failing input
>>>>>> elements. The type of failed elements is pretty straightforward -- just 
>>>>>> the
>>>>>> input elements. This allows handling such failures by directing them
>>>>>> elsewhere or performing additional processing.
>>>>>>
>>>>>
>>>>> BigQueryIO already does this as you point out.
>>>>>
>>>>>>
>>>>>> 2. For a sink that produces a series of files, it might be useful to
>>>>>> have a collection of the file names that have been completely written. 
>>>>>> This
>>>>>> allows performing additional handling on these completed segments.
>>>>>>
>>>>>
>>>>> In fact we already do this for FileBasedSinks.   See
>>>>> https://github.com/apache/beam/blob/7d53878768757ef2115170a5073b99956e924ff2/sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFilesResult.java
>>>>>
>>>>>>
>>>>>> 3. For a sink that updates some destination, it would be reasonable
>>>>>> to have a collection that provides (periodically) output indicating how
>>>>>> complete the information written to that destination is. For instance, 
>>>>>> this
>>>>>> might be something like "<this bigquery table> has all of the elements up
>>>>>> to <input watermark>" complete. This allows tracking how much information
>>>>>> has been completely written out.
>>>>>>
>>>>>
>>>>> Interesting. Maybe tough to do since sinks often don't have that
>>>>> knowledge.
>>>>>
>>>>>
>>>>>>
>>>>>> I think those concepts map to the more detailed description Eugene
>>>>>> provided, but I find it helpful to focus on what information comes out of
>>>>>> the sink and how it might be used.
>>>>>>
>>>>>> Were there any use cases the above miss? Any functionality that has
>>>>>> been described that doesn't map to these use cases?
>>>>>>
>>>>>> -- Ben
>>>>>>
>>>>>> On Mon, Dec 4, 2017 at 4:02 PM Eugene Kirpichov <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> It makes sense to consider how this maps onto existing kinds of
>>>>>>> sinks.
>>>>>>>
>>>>>>> E.g.:
>>>>>>> - Something that just makes an RPC per record, e.g. MqttIO.write():
>>>>>>> that will emit 1 result per bundle (either a bogus value or number of
>>>>>>> records written) that will be Combine'd into 1 result per pane of 
>>>>>>> input. A
>>>>>>> user can sequence against this and be notified when some intermediate
>>>>>>> amount of data has been written for a window, or (via .isFinal()) when 
>>>>>>> all
>>>>>>> of it has been written.
>>>>>>> - Something that e.g. initiates an import job, such as
>>>>>>> BigQueryIO.write(), or an ElasticsearchIO write with a follow-up atomic
>>>>>>> index swap: should emit 1 result per import job, e.g. containing
>>>>>>> information about the job (e.g. its id and statistics). Role of panes is
>>>>>>> the same.
>>>>>>> - Something like above but that supports dynamic destinations: like
>>>>>>> in WriteFiles, result will be PCollection<KV<DestinationT, ResultT>> 
>>>>>>> where
>>>>>>> ResultT may be something like a list of files that were written for this
>>>>>>> pane of this destination.
>>>>>>>
>>>>>>> On Mon, Dec 4, 2017 at 3:58 PM Eugene Kirpichov <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> I agree that the proper API for enabling the use case "do something
>>>>>>>> after the data has been written" is to return a PCollection of objects
>>>>>>>> where each object represents the result of writing some identifiable 
>>>>>>>> subset
>>>>>>>> of the data. Then one can apply a ParDo to this PCollection, in order 
>>>>>>>> to
>>>>>>>> "do something after this subset has been written".
>>>>>>>>
>>>>>>>> The challenging part here is *identifying* the subset of the data
>>>>>>>> that's been written, in a way consistent with Beam's unified
>>>>>>>> batch/streaming model, where saying "all data has been written" is not 
>>>>>>>> an
>>>>>>>> option because more data can arrive.
>>>>>>>>
>>>>>>>> The next choice is "a window of input has been written", but then
>>>>>>>> again, late data can arrive into a window as well.
>>>>>>>>
>>>>>>>> Next choice after that is "a pane of input has been written", but
>>>>>>>> per https://s.apache.org/beam-sink-triggers the term "pane of
>>>>>>>> input" is moot: triggering and panes should be something private to the
>>>>>>>> sink, and the same input can trigger different sinks differently. The
>>>>>>>> hypothetical different accumulation modes make this trickier still. 
>>>>>>>> I'm not
>>>>>>>> sure whether we intend to also challenge the idea that windowing is
>>>>>>>> inherent to the collection, or whether it too should be specified on a
>>>>>>>> transform that processes the collection. I think for the sake of this
>>>>>>>> discussion we can assume that it's inherent, and assume the mental 
>>>>>>>> model
>>>>>>>> that the elements in different windows of a PCollection are processed
>>>>>>>> independently - "as if" there were multiple pipelines processing each
>>>>>>>> window.
>>>>>>>>
>>>>>>>> Overall, embracing the full picture, we end up with something like
>>>>>>>> this:
>>>>>>>> - The input PCollection is a composition of windows.
>>>>>>>> - If the windowing strategy is non-merging (e.g. fixed or sliding
>>>>>>>> windows), the below applies to the entire contents of the PCollection. 
>>>>>>>> If
>>>>>>>> it's merging (e.g. session windows), then it applies per-key, and the 
>>>>>>>> input
>>>>>>>> should be (perhaps implicitly) keyed in a way that the sink 
>>>>>>>> understands -
>>>>>>>> for example, the grouping by destination in DynamicDestinations in 
>>>>>>>> file and
>>>>>>>> bigquery writes.
>>>>>>>> - Each window's contents is a "changelog" - stream of elements and
>>>>>>>> retractions.
>>>>>>>> - A "sink" processes each window of the collection, deciding how to
>>>>>>>> handle elements and retractions (and whether to support retractions at 
>>>>>>>> all)
>>>>>>>> in a sink-specific way, and deciding *when* to perform the side 
>>>>>>>> effects for
>>>>>>>> a portion of the changelog (a "pane") based on the sink's triggering
>>>>>>>> strategy.
>>>>>>>> - If the side effect itself is parallelized, then there'll be
>>>>>>>> multiple results for the pane - e.g. one per bundle.
>>>>>>>> - Each (sink-chosen) pane produces a set of results, e.g. a list of
>>>>>>>> filenames that have been written, or simply a number of records that 
>>>>>>>> was
>>>>>>>> written, or a bogus void value etc. The result will implicitly include 
>>>>>>>> the
>>>>>>>> window of the input it's associated with. It will also implicitly 
>>>>>>>> include
>>>>>>>> pane information - index of the pane in this window, and whether this 
>>>>>>>> is
>>>>>>>> the first or last pane.
>>>>>>>> - The partitioning into bundles is an implementation detail and not
>>>>>>>> very useful, so before presenting the pane write results to the user, 
>>>>>>>> the
>>>>>>>> sink will probably want to Combine the bundle results so that there 
>>>>>>>> ends up
>>>>>>>> being 1 value for each pane that was written. Once again note that 
>>>>>>>> panes
>>>>>>>> may be associated with windows of the input as a whole, but if the 
>>>>>>>> input is
>>>>>>>> keyed (like with DynamicDestinations) they'll be associated with 
>>>>>>>> per-key
>>>>>>>> subsets of windows of the input.
>>>>>>>> - This combining requires an extra, well, combining operation, so
>>>>>>>> it should be optional.
>>>>>>>> - The user will end up getting either a PCollection<ResultT> or a
>>>>>>>> PCollection<KV<KeyT, ResultT>>, for sink-specific KeyT and ResultT, 
>>>>>>>> where
>>>>>>>> the elements of this collection will implicitly have window and pane
>>>>>>>> information, available via the implicit BoundedWindow and PaneInfo.
>>>>>>>> - Until "sink triggering" is implemented, we'll have to embrace the
>>>>>>>> fact that trigger strategy is set on the input. But in that case the 
>>>>>>>> user
>>>>>>>> will have to accept that the PaneInfo of ResultT's is not necessarily
>>>>>>>> directly related to panes of the input - the sink is allowed to do 
>>>>>>>> internal
>>>>>>>> aggregation as an implementation detail, which may modify the 
>>>>>>>> triggering
>>>>>>>> strategy. Basically the user will still get sink-assigned panes.
>>>>>>>> - In most cases, one may imagine that the user is interested in
>>>>>>>> being notified of "no more data associated with this window will be
>>>>>>>> written", so the user will ignore all ResultT's except those where the 
>>>>>>>> pane
>>>>>>>> is marked final. If a user is interested in being notified of 
>>>>>>>> intermediate
>>>>>>>> write results - they'll have to embrace the fact that they cannot 
>>>>>>>> identify
>>>>>>>> the precise subset of input associated with the intermediate result.
>>>>>>>>
>>>>>>>> I think the really key points of the above are:
>>>>>>>> - Sinks should support windowed input. Sinks should write different
>>>>>>>> windows of input independently. If the sink can write multi-destination
>>>>>>>> input, the destination should function as a grouping key, and in that 
>>>>>>>> case
>>>>>>>> merging windowing should be allowed.
>>>>>>>> - Producing a PCollection of write results should be optional.
>>>>>>>> - When asked to produce results, sinks produce a PCollection of
>>>>>>>> results that may be keyed or unkeyed (per above), and are placed in the
>>>>>>>> window of the input that was written, and have a PaneInfo assigned by 
>>>>>>>> the
>>>>>>>> sink, of which probably the only part useful to the user is whether 
>>>>>>>> it's
>>>>>>>> .isFinal().
>>>>>>>>
>>>>>>>> Does this sound reasonable?
>>>>>>>>
>>>>>>>> On Mon, Dec 4, 2017 at 11:50 AM Robert Bradshaw <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> +1
>>>>>>>>>
>>>>>>>>> At the very least an empty PCollection<?> could be produced with no
>>>>>>>>> promises about its contents but the ability to be followed (e.g.
>>>>>>>>> as a
>>>>>>>>> side input), which is forward compatible with whatever actual
>>>>>>>>> metadata
>>>>>>>>> one may decide to produce in the future.
>>>>>>>>>
>>>>>>>>> On Mon, Dec 4, 2017 at 11:06 AM, Kenneth Knowles <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> > +dev@
>>>>>>>>> >
>>>>>>>>> > I am in complete agreement with Luke. Data dependencies are easy
>>>>>>>>> to
>>>>>>>>> > understand and a good way for an IO to communicate and establish
>>>>>>>>> causal
>>>>>>>>> > dependencies. Converting an IO from PDone to real output may
>>>>>>>>> spur further
>>>>>>>>> > useful thoughts based on the design decisions about what sort of
>>>>>>>>> output is
>>>>>>>>> > most useful.
>>>>>>>>> >
>>>>>>>>> > Kenn
>>>>>>>>> >
>>>>>>>>> > On Mon, Dec 4, 2017 at 10:42 AM, Lukasz Cwik <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> >>
>>>>>>>>> >> I think all sinks actually do have valuable information to
>>>>>>>>> output which
>>>>>>>>> >> can be used after a write (file names, transaction/commit/row
>>>>>>>>> ids, table
>>>>>>>>> >> names, ...). In addition to this metadata, having a PCollection
>>>>>>>>> of all
>>>>>>>>> >> successful writes and all failed writes is useful for users so
>>>>>>>>> they can
>>>>>>>>> >> chain an action which depends on what was or wasn't
>>>>>>>>> successfully written.
>>>>>>>>> >> Users have requested adding retry/failure handling policies to
>>>>>>>>> sinks so that
>>>>>>>>> >> failed writes don't jam up the pipeline.
>>>>>>>>> >>
>>>>>>>>> >> On Fri, Dec 1, 2017 at 2:43 PM, Chet Aldrich <
>>>>>>>>> [email protected]>
>>>>>>>>> >> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> So I agree generally with the idea that returning a
>>>>>>>>> PCollection makes all
>>>>>>>>> >>> of this easier so that arbitrary additional functions can be
>>>>>>>>> added, what
>>>>>>>>> >>> exactly would write functions be returning in a PCollection
>>>>>>>>> that would make
>>>>>>>>> >>> sense? The whole idea is that we’ve written to an external
>>>>>>>>> source and now
>>>>>>>>> >>> the collection itself is no longer needed.
>>>>>>>>> >>>
>>>>>>>>> >>> Currently, that’s represented with a PDone, but currently that
>>>>>>>>> doesn’t
>>>>>>>>> >>> allow any work to occur after it. I see a couple possible ways
>>>>>>>>> of handling
>>>>>>>>> >>> this given this conversation, and am curious which solution
>>>>>>>>> sounds like the
>>>>>>>>> >>> best way to deal with the problem:
>>>>>>>>> >>>
>>>>>>>>> >>> 1. Have output transforms always return something specific
>>>>>>>>> (which would
>>>>>>>>> >>> be the same across transforms by convention), that is in the
>>>>>>>>> form of a
>>>>>>>>> >>> PCollection, so operations can occur after it.
>>>>>>>>> >>>
>>>>>>>>> >>> 2. Make either PDone or some new type that can act as a
>>>>>>>>> PCollection so we
>>>>>>>>> >>> can run applies afterward.
>>>>>>>>> >>>
>>>>>>>>> >>> 3. Make output transforms provide the facility for a callback
>>>>>>>>> function
>>>>>>>>> >>> which runs after the transform is complete.
>>>>>>>>> >>>
>>>>>>>>> >>> I went through these gymnastics recently when I was trying to
>>>>>>>>> build
>>>>>>>>> >>> something that would move indices after writing to Algolia,
>>>>>>>>> and the solution
>>>>>>>>> >>> was to co-opt code from the old Sink class that used to exist
>>>>>>>>> in Beam. The
>>>>>>>>> >>> problem is that particular method requires the output
>>>>>>>>> transform in question
>>>>>>>>> >>> to return a PCollection, even if it is trivial or doesn’t make
>>>>>>>>> sense to
>>>>>>>>> >>> return one. This seems like a bad solution, but unfortunately
>>>>>>>>> there isn’t a
>>>>>>>>> >>> notion of a transform that has no explicit output that needs
>>>>>>>>> to have
>>>>>>>>> >>> operations occur after it.
>>>>>>>>> >>>
>>>>>>>>> >>> The three potential solutions above address this issue, but I
>>>>>>>>> would like
>>>>>>>>> >>> to hear on which would be preferable (or perhaps a different
>>>>>>>>> proposal
>>>>>>>>> >>> altogether?). Perhaps we could also start up a ticket on this,
>>>>>>>>> since it
>>>>>>>>> >>> seems like a worthwhile feature addition. I would find it
>>>>>>>>> really useful, for
>>>>>>>>> >>> one.
>>>>>>>>> >>>
>>>>>>>>> >>> Chet
>>>>>>>>> >>>
>>>>>>>>> >>> On Dec 1, 2017, at 12:19 PM, Lukasz Cwik <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>> >>>
>>>>>>>>> >>> Instead of a callback fn, its most useful if a PCollection is
>>>>>>>>> returned
>>>>>>>>> >>> containing the result of the sink so that any arbitrary
>>>>>>>>> additional functions
>>>>>>>>> >>> can be applied.
>>>>>>>>> >>>
>>>>>>>>> >>> On Fri, Dec 1, 2017 at 7:14 AM, Jean-Baptiste Onofré <
>>>>>>>>> [email protected]>
>>>>>>>>> >>> wrote:
>>>>>>>>> >>>>
>>>>>>>>> >>>> Agree, I would prefer to do the callback in the IO more than
>>>>>>>>> in the
>>>>>>>>> >>>> main.
>>>>>>>>> >>>>
>>>>>>>>> >>>> Regards
>>>>>>>>> >>>> JB
>>>>>>>>> >>>>
>>>>>>>>> >>>> On 12/01/2017 03:54 PM, Steve Niemitz wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> I do something almost exactly like this, but with BigtableIO
>>>>>>>>> instead.
>>>>>>>>> >>>>> I have a pull request open here [1] (which reminds me I need
>>>>>>>>> to finish this
>>>>>>>>> >>>>> up...).  It would really be nice for most IOs to support
>>>>>>>>> something like
>>>>>>>>> >>>>> this.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> Essentially you do a GroupByKey (or some CombineFn) on the
>>>>>>>>> output from
>>>>>>>>> >>>>> the BigtableIO, and then feed that into your function which
>>>>>>>>> will run when
>>>>>>>>> >>>>> all writes finish.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> You probably want to avoid doing something in the main
>>>>>>>>> method because
>>>>>>>>> >>>>> there's no guarantee it'll actually run (maybe the driver
>>>>>>>>> will die, get
>>>>>>>>> >>>>> killed, machine will explode, etc).
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> [1] https://github.com/apache/beam/pull/3997
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> On Fri, Dec 1, 2017 at 9:46 AM, NerdyNick <
>>>>>>>>> [email protected]
>>>>>>>>> >>>>> <mailto:[email protected]>> wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>     Assuming you're in Java. You could just follow on in
>>>>>>>>> your Main
>>>>>>>>> >>>>> method.
>>>>>>>>> >>>>>     Checking the state of the Result.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>     Example:
>>>>>>>>> >>>>>     PipelineResult result = pipeline.run();
>>>>>>>>> >>>>>     try {
>>>>>>>>> >>>>>     result.waitUntilFinish();
>>>>>>>>> >>>>>     if(result.getState() == PipelineResult.State.DONE) {
>>>>>>>>> >>>>>     //DO ES work
>>>>>>>>> >>>>>     }
>>>>>>>>> >>>>>     } catch(Exception e) {
>>>>>>>>> >>>>>     result.cancel();
>>>>>>>>> >>>>>     throw e;
>>>>>>>>> >>>>>     }
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>     Otherwise you could also use Oozie to construct a work
>>>>>>>>> flow.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>     On Fri, Dec 1, 2017 at 2:02 AM, Jean-Baptiste Onofré
>>>>>>>>> >>>>> <[email protected]
>>>>>>>>> >>>>>     <mailto:[email protected]>> wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>         Hi,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>         yes, we had a similar question some days ago.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>         We can imagine to have a user callback fn fired when
>>>>>>>>> the sink
>>>>>>>>> >>>>> batch is
>>>>>>>>> >>>>>         complete.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>         Let me think about that.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>         Regards
>>>>>>>>> >>>>>         JB
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>         On 12/01/2017 09:04 AM, Philip Chan wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>             Hey JB,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>             Thanks for getting back so quickly.
>>>>>>>>> >>>>>             I suppose in that case I would need a way of
>>>>>>>>> monitoring
>>>>>>>>> >>>>> when the ES
>>>>>>>>> >>>>>             transform completes successfully before I can
>>>>>>>>> proceed with
>>>>>>>>> >>>>> doing the
>>>>>>>>> >>>>>             swap.
>>>>>>>>> >>>>>             The problem with this is that I can't think of a
>>>>>>>>> good way
>>>>>>>>> >>>>> to
>>>>>>>>> >>>>>             determine that termination state short of
>>>>>>>>> polling the new
>>>>>>>>> >>>>> index to
>>>>>>>>> >>>>>             check the document count compared to the size of
>>>>>>>>> input
>>>>>>>>> >>>>> PCollection.
>>>>>>>>> >>>>>             That, or maybe I'd need to use an external
>>>>>>>>> system like you
>>>>>>>>> >>>>> mentioned
>>>>>>>>> >>>>>             to poll on the state of the pipeline (I'm using
>>>>>>>>> Google
>>>>>>>>> >>>>> Dataflow, so
>>>>>>>>> >>>>>             maybe there's a way to do this with some API).
>>>>>>>>> >>>>>             But I would have thought that there would be an
>>>>>>>>> easy way of
>>>>>>>>> >>>>> simply
>>>>>>>>> >>>>>             saying "do not process this transform until this
>>>>>>>>> other
>>>>>>>>> >>>>> transform
>>>>>>>>> >>>>>             completes".
>>>>>>>>> >>>>>             Is there no established way of "signaling"
>>>>>>>>> between
>>>>>>>>> >>>>> pipelines when
>>>>>>>>> >>>>>             some pipeline completes, or have some way of
>>>>>>>>> declaring a
>>>>>>>>> >>>>> dependency
>>>>>>>>> >>>>>             of 1 transform on another transform?
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>             Thanks again,
>>>>>>>>> >>>>>             Philip
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>             On Thu, Nov 30, 2017 at 11:44 PM, Jean-Baptiste
>>>>>>>>> Onofré
>>>>>>>>> >>>>>             <[email protected] <mailto:[email protected]>
>>>>>>>>> >>>>> <mailto:[email protected]
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>             <mailto:[email protected]>>> wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                  Hi Philip,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                  You won't be able to do (3) in the same
>>>>>>>>> pipeline as
>>>>>>>>> >>>>> the
>>>>>>>>> >>>>>             Elasticsearch Sink
>>>>>>>>> >>>>>                  PTransform ends the pipeline with PDone.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                  So, (3) has to be done in another pipeline
>>>>>>>>> (using a
>>>>>>>>> >>>>> DoFn) or in
>>>>>>>>> >>>>>             another
>>>>>>>>> >>>>>                  "system" (like Camel for instance). I would
>>>>>>>>> do a check
>>>>>>>>> >>>>> of the
>>>>>>>>> >>>>>             data in the
>>>>>>>>> >>>>>                  index and then trigger the swap there.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                  Regards
>>>>>>>>> >>>>>                  JB
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                  On 12/01/2017 08:41 AM, Philip Chan wrote:
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                      Hi,
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                      I'm pretty new to Beam, and I've been
>>>>>>>>> trying to
>>>>>>>>> >>>>> use the
>>>>>>>>> >>>>>             ElasticSearchIO
>>>>>>>>> >>>>>                      sink to write docs into ES.
>>>>>>>>> >>>>>                      With this, I want to be able to
>>>>>>>>> >>>>>                      1. ingest and transform rows from DB
>>>>>>>>> (done)
>>>>>>>>> >>>>>                      2. write JSON docs/strings into a new
>>>>>>>>> ES index
>>>>>>>>> >>>>> (done)
>>>>>>>>> >>>>>                      3. After (2) is complete and all
>>>>>>>>> documents are
>>>>>>>>> >>>>> written into
>>>>>>>>> >>>>>             a new index,
>>>>>>>>> >>>>>                      trigger an atomic index swap under an
>>>>>>>>> alias to
>>>>>>>>> >>>>> replace the
>>>>>>>>> >>>>>             current
>>>>>>>>> >>>>>                      aliased index with the new index
>>>>>>>>> generated in step
>>>>>>>>> >>>>> 2. This
>>>>>>>>> >>>>>             is basically
>>>>>>>>> >>>>>                      a single POST request to the ES cluster.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                      The problem I'm facing is that I don't
>>>>>>>>> seem to be
>>>>>>>>> >>>>> able to
>>>>>>>>> >>>>>             find a way to
>>>>>>>>> >>>>>                      have a way for (3) to happen after step
>>>>>>>>> (2) is
>>>>>>>>> >>>>> complete.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                      The ElasticSearchIO.Write transform
>>>>>>>>> returns a
>>>>>>>>> >>>>> PDone, and
>>>>>>>>> >>>>>             I'm not sure
>>>>>>>>> >>>>>                      how to proceed from there because it
>>>>>>>>> doesn't seem
>>>>>>>>> >>>>> to let me
>>>>>>>>> >>>>>             do another
>>>>>>>>> >>>>>                      apply on it to "define" a dependency.
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> <
>>>>>>>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>>>> >
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> <
>>>>>>>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> <
>>>>>>>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>>>> >>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> <
>>>>>>>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> <
>>>>>>>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>>>> >
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> <
>>>>>>>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>>>> >>>>>
>>>>>>>>> >>>>> <
>>>>>>>>> https://beam.apache.org/documentation/sdks/javadoc/2.1.0/org/apache/beam/sdk/io/elasticsearch/ElasticsearchIO.Write.html
>>>>>>>>> >>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                      Is there a recommended way to construct
>>>>>>>>> pipelines
>>>>>>>>> >>>>> workflows
>>>>>>>>> >>>>>             like this?
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                      Thanks in advance,
>>>>>>>>> >>>>>                      Philip
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>                  --     Jean-Baptiste Onofré
>>>>>>>>> >>>>>             [email protected] <mailto:[email protected]>
>>>>>>>>> >>>>>             <mailto:[email protected] <mailto:
>>>>>>>>> [email protected]>>
>>>>>>>>> >>>>>             http://blog.nanthrax.net
>>>>>>>>> >>>>>                  Talend - http://www.talend.com
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>         --         Jean-Baptiste Onofré
>>>>>>>>> >>>>>         [email protected] <mailto:[email protected]>
>>>>>>>>> >>>>>         http://blog.nanthrax.net
>>>>>>>>> >>>>>         Talend - http://www.talend.com
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>     --     Nick Verbeck - NerdyNick
>>>>>>>>> >>>>>     ----------------------------------------------------
>>>>>>>>> >>>>>     NerdyNick.com <http://NerdyNick.com>
>>>>>>>>> >>>>>     TrailsOffroad.com <http://TrailsOffroad.com>
>>>>>>>>> >>>>>     NoKnownBoundaries.com <http://NoKnownBoundaries.com>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>>
>>>>>>>>> >>>>
>>>>>>>>> >>>> --
>>>>>>>>> >>>> Jean-Baptiste Onofré
>>>>>>>>> >>>> [email protected]
>>>>>>>>> >>>> http://blog.nanthrax.net
>>>>>>>>> >>>> Talend - http://www.talend.com
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>>
>>>>>>>>> >>
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>
>>>

Re: Callbacks/other functions run after a PDone/output transform

Reply via email to