Re: Return types of Write transforms (aka best way to signal)

2019-07-15 Thread Kenneth Knowles
To me (6) PCollection feels like an obvious
choice, and is actually the same as (4) at the core, as I think has pretty
much been said. Basically, the type actually describes what is going on. If
there's one or more PCollections doesn't really matter. All of the other
options seem strange to me, though I trust the authors had design reasons
that drove them to their choices, that we should anyhow respect and
consider. But using a worse type to make it easier to change doesn't make
sense to me. If we are in that unstable state, we should communicate it as
experimental, and there are easy backwards-compatible alternatives.

On Thu, Jun 27, 2019 at 9:32 AM Reuven Lax  wrote:

> This is a good question, because many sinks are logically _not_ windowed.
> They aren't producing aggregations, so logically they are often treated as
> if they are in the global window (and many internal window into the global
> window first thing).
>

I would go farther and say that this applies to *all* sinks unless there's
a very special design consideration. When data exits the pipeline it is by
definition in the global window, even if there is windowing internally to
the write transform. "Windowed writes" is a particular way of _converting_
data into the global window.

The event time of the output from a write transform might naturally
describe either of: (a) the time at which the write took place (b) some
summary of the events that were written. So that output can then be
windowed. It is an interesting design decision how to balance these two.

Kenn


> Wait is a nice transform that reuses existing windowing, but I wonder if
> there's another way to model this without relying on windowing.
> Essentially you want way to track element provenance - when all results
> from a single element are flushed through another transform, then trigger a
> second transform. Element provenance is interesting in many other use cases
> as well (e.g. debugging: given an output element, what input elements
> caused it?). Maybe there's a more direct way to model this problem without
> trying to use windowing to track causality?
>
> Reuven
>
> On Thu, Jun 27, 2019 at 4:38 AM Reza Rokni  wrote:
>
>> The use case of a transform  waiting for a SInk or Sinks to complete is
>> very interesting indeed!
>>
>> Curious, if a sink internally makes use of a Global Window with
>> processing time triggers to push its writes, what mechanism could be used
>> to release a transform waiting for a signal from the Sink(s) that all
>> processing is done and it can move forward?
>>
>> On Thu, 27 Jun 2019 at 03:58, Robert Bradshaw 
>> wrote:
>>
>>> Regarding Python, yes and no. Python doesn't distinguish at compile
>>> time between (1), (2), and (6), but that doesn't mean it isn't part of
>>> the public API and people might start counting on it, so it's in some
>>> sense worse. We can also do (3) (which is less cumbersome in Python,
>>> either returning a tuple or a dict) or (4).
>>>
>>> Good point about providing a simple solution (something that can be
>>> waited on at least) and allowing for with* modifiers to return more.
>>>
>>> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath 
>>> wrote:
>>> >
>>> > BTW regarding Python SDK, I think the answer to this question is
>>> simpler for Python SDK due to the lack of types. Most examples I know just
>>> return a PCollection from the Write transform which may or may not be
>>> ignored by users. If the PCollection is used, the user should be aware of
>>> the element type of the returned PCollection and should use it accordingly
>>> in subsequent transforms.
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw 
>>> wrote:
>>> >>>
>>> >>> Good question.
>>> >>>
>>> >>> I'm not sure what could be done with (5) if it contains no deferred
>>> >>> objects (e.g there's nothing to wait on).
>>> >>>
>>> >>> There is also (6) return PCollection. The
>>> >>> advantage of (2) is that one can migrate to (1) or (6) without
>>> >>> changing the public API, while giving something to wait on without
>>> >>> promising anything about its contents.
>>> >>>
>>> >>>
>>> >>> I would probably lean towards (4) for anything that would want to
>>> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
>>> >>> and view (3) as being a "cheap" but more cumbersome for the user way
>>> >>> of writing (4). In both cases, more information can be added in a
>>> >>> forward-compatible way. Technically (4) could extend (3) if one wants
>>> >>> to migrate from (3) to (4) to provide a nicer API in the future. (As
>>> >>> an aside, it would be interesting if any of the schema work that lets
>>> >>> us get rid of tuple tags for elements (e.g. join operations) could
>>> let
>>> >>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
>>> >>> with PCollection members be as powerful as a 

Re: Return types of Write transforms (aka best way to signal)

2019-07-15 Thread Lukasz Cwik
In the POutput case (4), does that mean we will have to compute all those
outputs in the transform even if they aren't used?

If yes, I prefer (6) because it allows for the transform structure to be
modified to either produce these additional outputs only if they will be
consumed instead of having them produced all the time.

On Mon, Jul 15, 2019 at 11:20 AM Ismaël Mejía  wrote:

> Just wanted to bring back the conversation on this subject. A quick
> abstract of the discussion so far:
>
> We are trying to agree in the best approach for return types in Write
> transforms towards some sort of ‘homogenization’ in IOs.
> At the moment we mostly agree that the best approach for return types
> in Writes are:
>
> 4. Write returns ‘a class that implements POutput’
> 6. Write returns `PCollection`
>
> There are still some details to discuss:
>
> 1. Is (4) somehow less composable than (6) ?
> 2. Could it make sense (in the API sense) that `SourceSpecificWriteResult`
> contains a PCollection too as an attribute to cover the ‘tuple’ return
> type issue?
>
> In its last email Reuven mentioned some extra points that could change
> the direction towards one of the options. Anyone else has comments /
> more ideas we may be missing?
>
> On Thu, Jun 27, 2019 at 6:32 PM Reuven Lax  wrote:
> >
> > This is a good question, because many sinks are logically _not_
> windowed. They aren't producing aggregations, so logically they are often
> treated as if they are in the global window (and many internal window into
> the global window first thing).
> >
> > Wait is a nice transform that reuses existing windowing, but I wonder if
> there's another way to model this without relying on windowing. Essentially
> you want way to track element provenance - when all results from a single
> element are flushed through another transform, then trigger a second
> transform. Element provenance is interesting in many other use cases as
> well (e.g. debugging: given an output element, what input elements caused
> it?). Maybe there's a more direct way to model this problem without trying
> to use windowing to track causality?
> >
> > Reuven
> >
> > On Thu, Jun 27, 2019 at 4:38 AM Reza Rokni  wrote:
> >>
> >> The use case of a transform  waiting for a SInk or Sinks to complete is
> very interesting indeed!
> >>
> >> Curious, if a sink internally makes use of a Global Window with
> processing time triggers to push its writes, what mechanism could be used
> to release a transform waiting for a signal from the Sink(s) that all
> processing is done and it can move forward?
> >>
> >> On Thu, 27 Jun 2019 at 03:58, Robert Bradshaw 
> wrote:
> >>>
> >>> Regarding Python, yes and no. Python doesn't distinguish at compile
> >>> time between (1), (2), and (6), but that doesn't mean it isn't part of
> >>> the public API and people might start counting on it, so it's in some
> >>> sense worse. We can also do (3) (which is less cumbersome in Python,
> >>> either returning a tuple or a dict) or (4).
> >>>
> >>> Good point about providing a simple solution (something that can be
> >>> waited on at least) and allowing for with* modifiers to return more.
> >>>
> >>> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>> >
> >>> > BTW regarding Python SDK, I think the answer to this question is
> simpler for Python SDK due to the lack of types. Most examples I know just
> return a PCollection from the Write transform which may or may not be
> ignored by users. If the PCollection is used, the user should be aware of
> the element type of the returned PCollection and should use it accordingly
> in subsequent transforms.
> >>> >
> >>> > Thanks,
> >>> > Cham
> >>> >
> >>> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath <
> chamik...@google.com> wrote:
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw <
> rober...@google.com> wrote:
> >>> >>>
> >>> >>> Good question.
> >>> >>>
> >>> >>> I'm not sure what could be done with (5) if it contains no deferred
> >>> >>> objects (e.g there's nothing to wait on).
> >>> >>>
> >>> >>> There is also (6) return PCollection.
> The
> >>> >>> advantage of (2) is that one can migrate to (1) or (6) without
> >>> >>> changing the public API, while giving something to wait on without
> >>> >>> promising anything about its contents.
> >>> >>>
> >>> >>>
> >>> >>> I would probably lean towards (4) for anything that would want to
> >>> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
> >>> >>> and view (3) as being a "cheap" but more cumbersome for the user
> way
> >>> >>> of writing (4). In both cases, more information can be added in a
> >>> >>> forward-compatible way. Technically (4) could extend (3) if one
> wants
> >>> >>> to migrate from (3) to (4) to provide a nicer API in the future.
> (As
> >>> >>> an aside, it would be interesting if any of the schema work that
> lets
> >>> >>> us get rid of tuple tags for elements (e.g. join operations) 

Re: Return types of Write transforms (aka best way to signal)

2019-07-15 Thread Ismaël Mejía
Just wanted to bring back the conversation on this subject. A quick
abstract of the discussion so far:

We are trying to agree in the best approach for return types in Write
transforms towards some sort of ‘homogenization’ in IOs.
At the moment we mostly agree that the best approach for return types
in Writes are:

4. Write returns ‘a class that implements POutput’
6. Write returns `PCollection`

There are still some details to discuss:

1. Is (4) somehow less composable than (6) ?
2. Could it make sense (in the API sense) that `SourceSpecificWriteResult`
contains a PCollection too as an attribute to cover the ‘tuple’ return
type issue?

In its last email Reuven mentioned some extra points that could change
the direction towards one of the options. Anyone else has comments /
more ideas we may be missing?

On Thu, Jun 27, 2019 at 6:32 PM Reuven Lax  wrote:
>
> This is a good question, because many sinks are logically _not_ windowed. 
> They aren't producing aggregations, so logically they are often treated as if 
> they are in the global window (and many internal window into the global 
> window first thing).
>
> Wait is a nice transform that reuses existing windowing, but I wonder if 
> there's another way to model this without relying on windowing. Essentially 
> you want way to track element provenance - when all results from a single 
> element are flushed through another transform, then trigger a second 
> transform. Element provenance is interesting in many other use cases as well 
> (e.g. debugging: given an output element, what input elements caused it?). 
> Maybe there's a more direct way to model this problem without trying to use 
> windowing to track causality?
>
> Reuven
>
> On Thu, Jun 27, 2019 at 4:38 AM Reza Rokni  wrote:
>>
>> The use case of a transform  waiting for a SInk or Sinks to complete is very 
>> interesting indeed!
>>
>> Curious, if a sink internally makes use of a Global Window with processing 
>> time triggers to push its writes, what mechanism could be used to release a 
>> transform waiting for a signal from the Sink(s) that all processing is done 
>> and it can move forward?
>>
>> On Thu, 27 Jun 2019 at 03:58, Robert Bradshaw  wrote:
>>>
>>> Regarding Python, yes and no. Python doesn't distinguish at compile
>>> time between (1), (2), and (6), but that doesn't mean it isn't part of
>>> the public API and people might start counting on it, so it's in some
>>> sense worse. We can also do (3) (which is less cumbersome in Python,
>>> either returning a tuple or a dict) or (4).
>>>
>>> Good point about providing a simple solution (something that can be
>>> waited on at least) and allowing for with* modifiers to return more.
>>>
>>> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath  
>>> wrote:
>>> >
>>> > BTW regarding Python SDK, I think the answer to this question is simpler 
>>> > for Python SDK due to the lack of types. Most examples I know just return 
>>> > a PCollection from the Write transform which may or may not be ignored by 
>>> > users. If the PCollection is used, the user should be aware of the 
>>> > element type of the returned PCollection and should use it accordingly in 
>>> > subsequent transforms.
>>> >
>>> > Thanks,
>>> > Cham
>>> >
>>> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath  
>>> > wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw  
>>> >> wrote:
>>> >>>
>>> >>> Good question.
>>> >>>
>>> >>> I'm not sure what could be done with (5) if it contains no deferred
>>> >>> objects (e.g there's nothing to wait on).
>>> >>>
>>> >>> There is also (6) return PCollection. The
>>> >>> advantage of (2) is that one can migrate to (1) or (6) without
>>> >>> changing the public API, while giving something to wait on without
>>> >>> promising anything about its contents.
>>> >>>
>>> >>>
>>> >>> I would probably lean towards (4) for anything that would want to
>>> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
>>> >>> and view (3) as being a "cheap" but more cumbersome for the user way
>>> >>> of writing (4). In both cases, more information can be added in a
>>> >>> forward-compatible way. Technically (4) could extend (3) if one wants
>>> >>> to migrate from (3) to (4) to provide a nicer API in the future. (As
>>> >>> an aside, it would be interesting if any of the schema work that lets
>>> >>> us get rid of tuple tags for elements (e.g. join operations) could let
>>> >>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
>>> >>> with PCollection members be as powerful as a PCollectionTuple).
>>> >>>
>>> >>> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía  wrote:
>>> >>> >
>>> >>> > Beam introduced in version 2.4.0 the Wait transform to delay
>>> >>> > processing of each window in a PCollection until signaled. This opened
>>> >>> > new interesting patterns for example writing to a database and when
>>> >>> > ‘fully’ done write to another database.
>>> >>> >
>>> >>> > To support this 

Re: Return types of Write transforms (aka best way to signal)

2019-06-27 Thread Ismaël Mejía
Cham has a point in the fact that we can change writes in a
‘backwards’ compatible way if needed by providing a new Write
transform, of course the ideal is that we do not need to do this to
ease maintainability, but is a good point against (2) and (3). (1) is
a specific case of (2) so probably would be covered by this argument.

After reading Robert’s analysis I am also leaning towards (4) because
it gives a clear contract for users, but (6) definitely has the
explicit PCollection composability point. I think we can ignore (5)
since this is a specific case of (4).

So this let us with these two options:

4. Write returns ‘a class that implements POutput’ (and may wrap
different PCollections)
vs
6. Write returns a PCollection

I have two questions that may let us help to decide on the favorite approach:

1. Is (4) somehow less composable than (6) ?
2. Could it make sense (in the API sense) that `SourceSpecificWrite`
contains a PCollection too as an attribute to cover the ‘tuple’ return
type issue?

On Wed, Jun 26, 2019 at 9:58 PM Robert Bradshaw  wrote:
>
> Regarding Python, yes and no. Python doesn't distinguish at compile
> time between (1), (2), and (6), but that doesn't mean it isn't part of
> the public API and people might start counting on it, so it's in some
> sense worse. We can also do (3) (which is less cumbersome in Python,
> either returning a tuple or a dict) or (4).
>
> Good point about providing a simple solution (something that can be
> waited on at least) and allowing for with* modifiers to return more.
>
> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath  
> wrote:
> >
> > BTW regarding Python SDK, I think the answer to this question is simpler 
> > for Python SDK due to the lack of types. Most examples I know just return a 
> > PCollection from the Write transform which may or may not be ignored by 
> > users. If the PCollection is used, the user should be aware of the element 
> > type of the returned PCollection and should use it accordingly in 
> > subsequent transforms.
> >
> > Thanks,
> > Cham
> >
> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath  
> > wrote:
> >>
> >>
> >>
> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw  
> >> wrote:
> >>>
> >>> Good question.
> >>>
> >>> I'm not sure what could be done with (5) if it contains no deferred
> >>> objects (e.g there's nothing to wait on).
> >>>
> >>> There is also (6) return PCollection. The
> >>> advantage of (2) is that one can migrate to (1) or (6) without
> >>> changing the public API, while giving something to wait on without
> >>> promising anything about its contents.
> >>>
> >>>
> >>> I would probably lean towards (4) for anything that would want to
> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
> >>> and view (3) as being a "cheap" but more cumbersome for the user way
> >>> of writing (4). In both cases, more information can be added in a
> >>> forward-compatible way. Technically (4) could extend (3) if one wants
> >>> to migrate from (3) to (4) to provide a nicer API in the future. (As
> >>> an aside, it would be interesting if any of the schema work that lets
> >>> us get rid of tuple tags for elements (e.g. join operations) could let
> >>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
> >>> with PCollection members be as powerful as a PCollectionTuple).
> >>>
> >>> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía  wrote:
> >>> >
> >>> > Beam introduced in version 2.4.0 the Wait transform to delay
> >>> > processing of each window in a PCollection until signaled. This opened
> >>> > new interesting patterns for example writing to a database and when
> >>> > ‘fully’ done write to another database.
> >>> >
> >>> > To support this pattern an IO connector Write transform must return a
> >>> > type different from PDone to signal the processing of the next step.
> >>> > Some IOs have already started to implement this return type, but each
> >>> > returned type has different pros and cons so I wanted to open the
> >>> > discussion on this to see if we could somehow find a common pattern to
> >>> > suggest IO authors to follow (Note: It may be the case that there is
> >>> > not a pattern that fits certain use cases).
> >>> >
> >>> > So far the approaches in our code base are:
> >>> >
> >>> > 1. Write returns ‘PCollection’
> >>> >
> >>> > This is the simplest case but if subsequent transforms require more
> >>> > data that could have been produced during the write it gets ‘lost’.
> >>> > Used by JdbcIO and DynamoDBIO.
> >>> >
> >>> > 2. Write returns ‘PCollection’
> >>> >
> >>> > We can return whatever we want but the return type is uncertain for
> >>> > the user in case he wants to use information from it. This is less
> >>> > user friendly but has the maintenance advantage of not changing
> >>> > signatures if we want to change the return type in the future. Used by
> >>> > RabbitMQIO.
> >>> >
> >>> > 3. Write returns a `PCollectionTuple`
> >>> >
> >>> > It is like (2) but 

Re: Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Reza Rokni
The use case of a transform  waiting for a SInk or Sinks to complete is
very interesting indeed!

Curious, if a sink internally makes use of a Global Window with processing
time triggers to push its writes, what mechanism could be used to release a
transform waiting for a signal from the Sink(s) that all processing is done
and it can move forward?

On Thu, 27 Jun 2019 at 03:58, Robert Bradshaw  wrote:

> Regarding Python, yes and no. Python doesn't distinguish at compile
> time between (1), (2), and (6), but that doesn't mean it isn't part of
> the public API and people might start counting on it, so it's in some
> sense worse. We can also do (3) (which is less cumbersome in Python,
> either returning a tuple or a dict) or (4).
>
> Good point about providing a simple solution (something that can be
> waited on at least) and allowing for with* modifiers to return more.
>
> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath 
> wrote:
> >
> > BTW regarding Python SDK, I think the answer to this question is simpler
> for Python SDK due to the lack of types. Most examples I know just return a
> PCollection from the Write transform which may or may not be ignored by
> users. If the PCollection is used, the user should be aware of the element
> type of the returned PCollection and should use it accordingly in
> subsequent transforms.
> >
> > Thanks,
> > Cham
> >
> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath 
> wrote:
> >>
> >>
> >>
> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw 
> wrote:
> >>>
> >>> Good question.
> >>>
> >>> I'm not sure what could be done with (5) if it contains no deferred
> >>> objects (e.g there's nothing to wait on).
> >>>
> >>> There is also (6) return PCollection. The
> >>> advantage of (2) is that one can migrate to (1) or (6) without
> >>> changing the public API, while giving something to wait on without
> >>> promising anything about its contents.
> >>>
> >>>
> >>> I would probably lean towards (4) for anything that would want to
> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
> >>> and view (3) as being a "cheap" but more cumbersome for the user way
> >>> of writing (4). In both cases, more information can be added in a
> >>> forward-compatible way. Technically (4) could extend (3) if one wants
> >>> to migrate from (3) to (4) to provide a nicer API in the future. (As
> >>> an aside, it would be interesting if any of the schema work that lets
> >>> us get rid of tuple tags for elements (e.g. join operations) could let
> >>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
> >>> with PCollection members be as powerful as a PCollectionTuple).
> >>>
> >>> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía 
> wrote:
> >>> >
> >>> > Beam introduced in version 2.4.0 the Wait transform to delay
> >>> > processing of each window in a PCollection until signaled. This
> opened
> >>> > new interesting patterns for example writing to a database and when
> >>> > ‘fully’ done write to another database.
> >>> >
> >>> > To support this pattern an IO connector Write transform must return a
> >>> > type different from PDone to signal the processing of the next step.
> >>> > Some IOs have already started to implement this return type, but each
> >>> > returned type has different pros and cons so I wanted to open the
> >>> > discussion on this to see if we could somehow find a common pattern
> to
> >>> > suggest IO authors to follow (Note: It may be the case that there is
> >>> > not a pattern that fits certain use cases).
> >>> >
> >>> > So far the approaches in our code base are:
> >>> >
> >>> > 1. Write returns ‘PCollection’
> >>> >
> >>> > This is the simplest case but if subsequent transforms require more
> >>> > data that could have been produced during the write it gets ‘lost’.
> >>> > Used by JdbcIO and DynamoDBIO.
> >>> >
> >>> > 2. Write returns ‘PCollection’
> >>> >
> >>> > We can return whatever we want but the return type is uncertain for
> >>> > the user in case he wants to use information from it. This is less
> >>> > user friendly but has the maintenance advantage of not changing
> >>> > signatures if we want to change the return type in the future. Used
> by
> >>> > RabbitMQIO.
> >>> >
> >>> > 3. Write returns a `PCollectionTuple`
> >>> >
> >>> > It is like (2) but with the advantage of returning an untyped tuple
> of
> >>> > PCollections so we can return more things. Used by SnsIO.
> >>> >
> >>> > 4. Write returns ‘a class that implements POutput’
> >>> >
> >>> > This class wraps inside of the PCollections that were part of the
> >>> > write, e.g. SpannerWriteResult. This is useful because we can be
> >>> > interested on saving inside a PCollection of failed mutations apart
> of
> >>> > the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
> >>> > of this one is used by FileIO for Destinations via:
> >>> > ‘WriteFilesResult’.
> >>> >
> >>> > 5. Write returns ‘a class that implements POutput’ with specific data
> >>> 

Re: Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Robert Bradshaw
Regarding Python, yes and no. Python doesn't distinguish at compile
time between (1), (2), and (6), but that doesn't mean it isn't part of
the public API and people might start counting on it, so it's in some
sense worse. We can also do (3) (which is less cumbersome in Python,
either returning a tuple or a dict) or (4).

Good point about providing a simple solution (something that can be
waited on at least) and allowing for with* modifiers to return more.

On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath  wrote:
>
> BTW regarding Python SDK, I think the answer to this question is simpler for 
> Python SDK due to the lack of types. Most examples I know just return a 
> PCollection from the Write transform which may or may not be ignored by 
> users. If the PCollection is used, the user should be aware of the element 
> type of the returned PCollection and should use it accordingly in subsequent 
> transforms.
>
> Thanks,
> Cham
>
> On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath  
> wrote:
>>
>>
>>
>> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw  wrote:
>>>
>>> Good question.
>>>
>>> I'm not sure what could be done with (5) if it contains no deferred
>>> objects (e.g there's nothing to wait on).
>>>
>>> There is also (6) return PCollection. The
>>> advantage of (2) is that one can migrate to (1) or (6) without
>>> changing the public API, while giving something to wait on without
>>> promising anything about its contents.
>>>
>>>
>>> I would probably lean towards (4) for anything that would want to
>>> return multiple signals/outputs (e.g. successful vs. failed writes)
>>> and view (3) as being a "cheap" but more cumbersome for the user way
>>> of writing (4). In both cases, more information can be added in a
>>> forward-compatible way. Technically (4) could extend (3) if one wants
>>> to migrate from (3) to (4) to provide a nicer API in the future. (As
>>> an aside, it would be interesting if any of the schema work that lets
>>> us get rid of tuple tags for elements (e.g. join operations) could let
>>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
>>> with PCollection members be as powerful as a PCollectionTuple).
>>>
>>> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía  wrote:
>>> >
>>> > Beam introduced in version 2.4.0 the Wait transform to delay
>>> > processing of each window in a PCollection until signaled. This opened
>>> > new interesting patterns for example writing to a database and when
>>> > ‘fully’ done write to another database.
>>> >
>>> > To support this pattern an IO connector Write transform must return a
>>> > type different from PDone to signal the processing of the next step.
>>> > Some IOs have already started to implement this return type, but each
>>> > returned type has different pros and cons so I wanted to open the
>>> > discussion on this to see if we could somehow find a common pattern to
>>> > suggest IO authors to follow (Note: It may be the case that there is
>>> > not a pattern that fits certain use cases).
>>> >
>>> > So far the approaches in our code base are:
>>> >
>>> > 1. Write returns ‘PCollection’
>>> >
>>> > This is the simplest case but if subsequent transforms require more
>>> > data that could have been produced during the write it gets ‘lost’.
>>> > Used by JdbcIO and DynamoDBIO.
>>> >
>>> > 2. Write returns ‘PCollection’
>>> >
>>> > We can return whatever we want but the return type is uncertain for
>>> > the user in case he wants to use information from it. This is less
>>> > user friendly but has the maintenance advantage of not changing
>>> > signatures if we want to change the return type in the future. Used by
>>> > RabbitMQIO.
>>> >
>>> > 3. Write returns a `PCollectionTuple`
>>> >
>>> > It is like (2) but with the advantage of returning an untyped tuple of
>>> > PCollections so we can return more things. Used by SnsIO.
>>> >
>>> > 4. Write returns ‘a class that implements POutput’
>>> >
>>> > This class wraps inside of the PCollections that were part of the
>>> > write, e.g. SpannerWriteResult. This is useful because we can be
>>> > interested on saving inside a PCollection of failed mutations apart of
>>> > the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
>>> > of this one is used by FileIO for Destinations via:
>>> > ‘WriteFilesResult’.
>>> >
>>> > 5. Write returns ‘a class that implements POutput’ with specific data
>>> > (no PCollections)
>>> >
>>> > This is similar to (4) but with the difference that the returned type
>>> > contains the specific data that may be needed next, for example not a
>>> > PCollection but values like the number of rows written. Used by
>>> > BigtableIO (PR in review at the moment). (This can be seen as a
>>> > simpler version of 4).
>>
>>
>> Thanks Ismaël for detailing various approaches with examples.
>>
>> I think current PR for BigTable returns a PCollection  
>> from a PTransform 'WithWriteResults' that can be optionally invoked through 
>> a 

Re: Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Chamikara Jayalath
BTW regarding Python SDK, I think the answer to this question is simpler
for Python SDK due to the lack of types. Most examples I know just return a
PCollection from the Write transform which may or may not be ignored by
users. If the PCollection is used, the user should be aware of the element
type of the returned PCollection and should use it accordingly in
subsequent transforms.

Thanks,
Cham

On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath 
wrote:

>
>
> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw 
> wrote:
>
>> Good question.
>>
>> I'm not sure what could be done with (5) if it contains no deferred
>> objects (e.g there's nothing to wait on).
>>
>> There is also (6) return PCollection. The
>> advantage of (2) is that one can migrate to (1) or (6) without
>> changing the public API, while giving something to wait on without
>> promising anything about its contents.
>
>
>> I would probably lean towards (4) for anything that would want to
>> return multiple signals/outputs (e.g. successful vs. failed writes)
>> and view (3) as being a "cheap" but more cumbersome for the user way
>> of writing (4). In both cases, more information can be added in a
>> forward-compatible way. Technically (4) could extend (3) if one wants
>> to migrate from (3) to (4) to provide a nicer API in the future. (As
>> an aside, it would be interesting if any of the schema work that lets
>> us get rid of tuple tags for elements (e.g. join operations) could let
>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
>> with PCollection members be as powerful as a PCollectionTuple).
>>
>> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía  wrote:
>> >
>> > Beam introduced in version 2.4.0 the Wait transform to delay
>> > processing of each window in a PCollection until signaled. This opened
>> > new interesting patterns for example writing to a database and when
>> > ‘fully’ done write to another database.
>> >
>> > To support this pattern an IO connector Write transform must return a
>> > type different from PDone to signal the processing of the next step.
>> > Some IOs have already started to implement this return type, but each
>> > returned type has different pros and cons so I wanted to open the
>> > discussion on this to see if we could somehow find a common pattern to
>> > suggest IO authors to follow (Note: It may be the case that there is
>> > not a pattern that fits certain use cases).
>> >
>> > So far the approaches in our code base are:
>> >
>> > 1. Write returns ‘PCollection’
>> >
>> > This is the simplest case but if subsequent transforms require more
>> > data that could have been produced during the write it gets ‘lost’.
>> > Used by JdbcIO and DynamoDBIO.
>> >
>> > 2. Write returns ‘PCollection’
>> >
>> > We can return whatever we want but the return type is uncertain for
>> > the user in case he wants to use information from it. This is less
>> > user friendly but has the maintenance advantage of not changing
>> > signatures if we want to change the return type in the future. Used by
>> > RabbitMQIO.
>> >
>> > 3. Write returns a `PCollectionTuple`
>> >
>> > It is like (2) but with the advantage of returning an untyped tuple of
>> > PCollections so we can return more things. Used by SnsIO.
>> >
>> > 4. Write returns ‘a class that implements POutput’
>> >
>> > This class wraps inside of the PCollections that were part of the
>> > write, e.g. SpannerWriteResult. This is useful because we can be
>> > interested on saving inside a PCollection of failed mutations apart of
>> > the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
>> > of this one is used by FileIO for Destinations via:
>> > ‘WriteFilesResult’.
>> >
>> > 5. Write returns ‘a class that implements POutput’ with specific data
>> > (no PCollections)
>> >
>> > This is similar to (4) but with the difference that the returned type
>> > contains the specific data that may be needed next, for example not a
>> > PCollection but values like the number of rows written. Used by
>> > BigtableIO (PR in review at the moment). (This can be seen as a
>> > simpler version of 4).
>>
>
> Thanks Ismaël for detailing various approaches with examples.
>
> I think current PR for BigTable returns a
> PCollection  from a PTransform 'WithWriteResults' that
> can be optionally invoked through a BigTableIO.Write.withWriteResults(). So
> this is more closer to (6) Robert mentioned. But (1) was also discussed as
> an option. PR is https://github.com/apache/beam/pull/7805 for anybody
> interested.
>
> I think (6) is less cumbersome to implement/use and allows us to easily
> extend the transform through more chaining or by changing the return
> transform through additional "with*" methods to the FooIO.Write class.
>
> Thanks,
> Cham
>
> >
>> > I would like to have your opinions on which approach you think it is
>> > better or worse and arguments if you see other
>> > advantages/disadvantages. I am probably more in the (4) camp but I
>> > feel somehow 

Re: Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Chamikara Jayalath
On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw  wrote:

> Good question.
>
> I'm not sure what could be done with (5) if it contains no deferred
> objects (e.g there's nothing to wait on).
>
> There is also (6) return PCollection. The
> advantage of (2) is that one can migrate to (1) or (6) without
> changing the public API, while giving something to wait on without
> promising anything about its contents.


> I would probably lean towards (4) for anything that would want to
> return multiple signals/outputs (e.g. successful vs. failed writes)
> and view (3) as being a "cheap" but more cumbersome for the user way
> of writing (4). In both cases, more information can be added in a
> forward-compatible way. Technically (4) could extend (3) if one wants
> to migrate from (3) to (4) to provide a nicer API in the future. (As
> an aside, it would be interesting if any of the schema work that lets
> us get rid of tuple tags for elements (e.g. join operations) could let
> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
> with PCollection members be as powerful as a PCollectionTuple).
>
> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía  wrote:
> >
> > Beam introduced in version 2.4.0 the Wait transform to delay
> > processing of each window in a PCollection until signaled. This opened
> > new interesting patterns for example writing to a database and when
> > ‘fully’ done write to another database.
> >
> > To support this pattern an IO connector Write transform must return a
> > type different from PDone to signal the processing of the next step.
> > Some IOs have already started to implement this return type, but each
> > returned type has different pros and cons so I wanted to open the
> > discussion on this to see if we could somehow find a common pattern to
> > suggest IO authors to follow (Note: It may be the case that there is
> > not a pattern that fits certain use cases).
> >
> > So far the approaches in our code base are:
> >
> > 1. Write returns ‘PCollection’
> >
> > This is the simplest case but if subsequent transforms require more
> > data that could have been produced during the write it gets ‘lost’.
> > Used by JdbcIO and DynamoDBIO.
> >
> > 2. Write returns ‘PCollection’
> >
> > We can return whatever we want but the return type is uncertain for
> > the user in case he wants to use information from it. This is less
> > user friendly but has the maintenance advantage of not changing
> > signatures if we want to change the return type in the future. Used by
> > RabbitMQIO.
> >
> > 3. Write returns a `PCollectionTuple`
> >
> > It is like (2) but with the advantage of returning an untyped tuple of
> > PCollections so we can return more things. Used by SnsIO.
> >
> > 4. Write returns ‘a class that implements POutput’
> >
> > This class wraps inside of the PCollections that were part of the
> > write, e.g. SpannerWriteResult. This is useful because we can be
> > interested on saving inside a PCollection of failed mutations apart of
> > the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
> > of this one is used by FileIO for Destinations via:
> > ‘WriteFilesResult’.
> >
> > 5. Write returns ‘a class that implements POutput’ with specific data
> > (no PCollections)
> >
> > This is similar to (4) but with the difference that the returned type
> > contains the specific data that may be needed next, for example not a
> > PCollection but values like the number of rows written. Used by
> > BigtableIO (PR in review at the moment). (This can be seen as a
> > simpler version of 4).
>

Thanks Ismaël for detailing various approaches with examples.

I think current PR for BigTable returns a PCollection
from a PTransform 'WithWriteResults' that can be optionally invoked through
a BigTableIO.Write.withWriteResults(). So this is more closer to (6) Robert
mentioned. But (1) was also discussed as an option. PR is
https://github.com/apache/beam/pull/7805 for anybody interested.

I think (6) is less cumbersome to implement/use and allows us to easily
extend the transform through more chaining or by changing the return
transform through additional "with*" methods to the FooIO.Write class.

Thanks,
Cham

>
> > I would like to have your opinions on which approach you think it is
> > better or worse and arguments if you see other
> > advantages/disadvantages. I am probably more in the (4) camp but I
> > feel somehow attracted by the flexibility that the lack of strict
> > typing brings in (2, 3) in case of changes to the public IO API (of
> > course this can be contested too).
> >
> > Any other ideas, preferences, issues we may be missing?
>


Re: Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Robert Bradshaw
Good question.

I'm not sure what could be done with (5) if it contains no deferred
objects (e.g there's nothing to wait on).

There is also (6) return PCollection. The
advantage of (2) is that one can migrate to (1) or (6) without
changing the public API, while giving something to wait on without
promising anything about its contents.

I would probably lean towards (4) for anything that would want to
return multiple signals/outputs (e.g. successful vs. failed writes)
and view (3) as being a "cheap" but more cumbersome for the user way
of writing (4). In both cases, more information can be added in a
forward-compatible way. Technically (4) could extend (3) if one wants
to migrate from (3) to (4) to provide a nicer API in the future. (As
an aside, it would be interesting if any of the schema work that lets
us get rid of tuple tags for elements (e.g. join operations) could let
us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
with PCollection members be as powerful as a PCollectionTuple).

On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía  wrote:
>
> Beam introduced in version 2.4.0 the Wait transform to delay
> processing of each window in a PCollection until signaled. This opened
> new interesting patterns for example writing to a database and when
> ‘fully’ done write to another database.
>
> To support this pattern an IO connector Write transform must return a
> type different from PDone to signal the processing of the next step.
> Some IOs have already started to implement this return type, but each
> returned type has different pros and cons so I wanted to open the
> discussion on this to see if we could somehow find a common pattern to
> suggest IO authors to follow (Note: It may be the case that there is
> not a pattern that fits certain use cases).
>
> So far the approaches in our code base are:
>
> 1. Write returns ‘PCollection’
>
> This is the simplest case but if subsequent transforms require more
> data that could have been produced during the write it gets ‘lost’.
> Used by JdbcIO and DynamoDBIO.
>
> 2. Write returns ‘PCollection’
>
> We can return whatever we want but the return type is uncertain for
> the user in case he wants to use information from it. This is less
> user friendly but has the maintenance advantage of not changing
> signatures if we want to change the return type in the future. Used by
> RabbitMQIO.
>
> 3. Write returns a `PCollectionTuple`
>
> It is like (2) but with the advantage of returning an untyped tuple of
> PCollections so we can return more things. Used by SnsIO.
>
> 4. Write returns ‘a class that implements POutput’
>
> This class wraps inside of the PCollections that were part of the
> write, e.g. SpannerWriteResult. This is useful because we can be
> interested on saving inside a PCollection of failed mutations apart of
> the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
> of this one is used by FileIO for Destinations via:
> ‘WriteFilesResult’.
>
> 5. Write returns ‘a class that implements POutput’ with specific data
> (no PCollections)
>
> This is similar to (4) but with the difference that the returned type
> contains the specific data that may be needed next, for example not a
> PCollection but values like the number of rows written. Used by
> BigtableIO (PR in review at the moment). (This can be seen as a
> simpler version of 4).
>
> I would like to have your opinions on which approach you think it is
> better or worse and arguments if you see other
> advantages/disadvantages. I am probably more in the (4) camp but I
> feel somehow attracted by the flexibility that the lack of strict
> typing brings in (2, 3) in case of changes to the public IO API (of
> course this can be contested too).
>
> Any other ideas, preferences, issues we may be missing?


Return types of Write transforms (aka best way to signal)

2019-06-26 Thread Ismaël Mejía
Beam introduced in version 2.4.0 the Wait transform to delay
processing of each window in a PCollection until signaled. This opened
new interesting patterns for example writing to a database and when
‘fully’ done write to another database.

To support this pattern an IO connector Write transform must return a
type different from PDone to signal the processing of the next step.
Some IOs have already started to implement this return type, but each
returned type has different pros and cons so I wanted to open the
discussion on this to see if we could somehow find a common pattern to
suggest IO authors to follow (Note: It may be the case that there is
not a pattern that fits certain use cases).

So far the approaches in our code base are:

1. Write returns ‘PCollection’

This is the simplest case but if subsequent transforms require more
data that could have been produced during the write it gets ‘lost’.
Used by JdbcIO and DynamoDBIO.

2. Write returns ‘PCollection’

We can return whatever we want but the return type is uncertain for
the user in case he wants to use information from it. This is less
user friendly but has the maintenance advantage of not changing
signatures if we want to change the return type in the future. Used by
RabbitMQIO.

3. Write returns a `PCollectionTuple`

It is like (2) but with the advantage of returning an untyped tuple of
PCollections so we can return more things. Used by SnsIO.

4. Write returns ‘a class that implements POutput’

This class wraps inside of the PCollections that were part of the
write, e.g. SpannerWriteResult. This is useful because we can be
interested on saving inside a PCollection of failed mutations apart of
the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
of this one is used by FileIO for Destinations via:
‘WriteFilesResult’.

5. Write returns ‘a class that implements POutput’ with specific data
(no PCollections)

This is similar to (4) but with the difference that the returned type
contains the specific data that may be needed next, for example not a
PCollection but values like the number of rows written. Used by
BigtableIO (PR in review at the moment). (This can be seen as a
simpler version of 4).

I would like to have your opinions on which approach you think it is
better or worse and arguments if you see other
advantages/disadvantages. I am probably more in the (4) camp but I
feel somehow attracted by the flexibility that the lack of strict
typing brings in (2, 3) in case of changes to the public IO API (of
course this can be contested too).

Any other ideas, preferences, issues we may be missing?