Re: Return types of Write transforms (aka best way to signal)

Ismaël Mejía Thu, 27 Jun 2019 09:04:32 -0700

Cham has a point in the fact that we can change writes in a
‘backwards’ compatible way if needed by providing a new Write
transform, of course the ideal is that we do not need to do this to
ease maintainability, but is a good point against (2) and (3). (1) is
a specific case of (2) so probably would be covered by this argument.


After reading Robert’s analysis I am also leaning towards (4) because
it gives a clear contract for users, but (6) definitely has the
explicit PCollection composability point. I think we can ignore (5)
since this is a specific case of (4).

So this let us with these two options:

4. Write returns ‘a class that implements POutput’ (and may wrap
different PCollections)
vs
6. Write returns a PCollection<SourceSpecificWrite>

I have two questions that may let us help to decide on the favorite approach:

1. Is (4) somehow less composable than (6) ?
2. Could it make sense (in the API sense) that `SourceSpecificWrite`
contains a PCollection too as an attribute to cover the ‘tuple’ return
type issue?

On Wed, Jun 26, 2019 at 9:58 PM Robert Bradshaw <rober...@google.com> wrote:
>
> Regarding Python, yes and no. Python doesn't distinguish at compile
> time between (1), (2), and (6), but that doesn't mean it isn't part of
> the public API and people might start counting on it, so it's in some
> sense worse. We can also do (3) (which is less cumbersome in Python,
> either returning a tuple or a dict) or (4).
>
> Good point about providing a simple solution (something that can be
> waited on at least) and allowing for with* modifiers to return more.
>
> On Wed, Jun 26, 2019 at 7:08 PM Chamikara Jayalath <chamik...@google.com> 
> wrote:
> >
> > BTW regarding Python SDK, I think the answer to this question is simpler 
> > for Python SDK due to the lack of types. Most examples I know just return a 
> > PCollection from the Write transform which may or may not be ignored by 
> > users. If the PCollection is used, the user should be aware of the element 
> > type of the returned PCollection and should use it accordingly in 
> > subsequent transforms.
> >
> > Thanks,
> > Cham
> >
> > On Wed, Jun 26, 2019 at 9:57 AM Chamikara Jayalath <chamik...@google.com> 
> > wrote:
> >>
> >>
> >>
> >> On Wed, Jun 26, 2019 at 5:46 AM Robert Bradshaw <rober...@google.com> 
> >> wrote:
> >>>
> >>> Good question.
> >>>
> >>> I'm not sure what could be done with (5) if it contains no deferred
> >>> objects (e.g there's nothing to wait on).
> >>>
> >>> There is also (6) return PCollection<SourceSpecificWriteResult>. The
> >>> advantage of (2) is that one can migrate to (1) or (6) without
> >>> changing the public API, while giving something to wait on without
> >>> promising anything about its contents.
> >>>
> >>>
> >>> I would probably lean towards (4) for anything that would want to
> >>> return multiple signals/outputs (e.g. successful vs. failed writes)
> >>> and view (3) as being a "cheap" but more cumbersome for the user way
> >>> of writing (4). In both cases, more information can be added in a
> >>> forward-compatible way. Technically (4) could extend (3) if one wants
> >>> to migrate from (3) to (4) to provide a nicer API in the future. (As
> >>> an aside, it would be interesting if any of the schema work that lets
> >>> us get rid of tuple tags for elements (e.g. join operations) could let
> >>> us get rid of tuple tags for PCollectionTuples (e.g. letting a POJO
> >>> with PCollection members be as powerful as a PCollectionTuple).
> >>>
> >>> On Wed, Jun 26, 2019 at 2:23 PM Ismaël Mejía <ieme...@gmail.com> wrote:
> >>> >
> >>> > Beam introduced in version 2.4.0 the Wait transform to delay
> >>> > processing of each window in a PCollection until signaled. This opened
> >>> > new interesting patterns for example writing to a database and when
> >>> > ‘fully’ done write to another database.
> >>> >
> >>> > To support this pattern an IO connector Write transform must return a
> >>> > type different from PDone to signal the processing of the next step.
> >>> > Some IOs have already started to implement this return type, but each
> >>> > returned type has different pros and cons so I wanted to open the
> >>> > discussion on this to see if we could somehow find a common pattern to
> >>> > suggest IO authors to follow (Note: It may be the case that there is
> >>> > not a pattern that fits certain use cases).
> >>> >
> >>> > So far the approaches in our code base are:
> >>> >
> >>> > 1. Write returns ‘PCollection<Void>’
> >>> >
> >>> > This is the simplest case but if subsequent transforms require more
> >>> > data that could have been produced during the write it gets ‘lost’.
> >>> > Used by JdbcIO and DynamoDBIO.
> >>> >
> >>> > 2. Write returns ‘PCollection<?>’
> >>> >
> >>> > We can return whatever we want but the return type is uncertain for
> >>> > the user in case he wants to use information from it. This is less
> >>> > user friendly but has the maintenance advantage of not changing
> >>> > signatures if we want to change the return type in the future. Used by
> >>> > RabbitMQIO.
> >>> >
> >>> > 3. Write returns a `PCollectionTuple`
> >>> >
> >>> > It is like (2) but with the advantage of returning an untyped tuple of
> >>> > PCollections so we can return more things. Used by SnsIO.
> >>> >
> >>> > 4. Write returns ‘a class that implements POutput’
> >>> >
> >>> > This class wraps inside of the PCollections that were part of the
> >>> > write, e.g. SpannerWriteResult. This is useful because we can be
> >>> > interested on saving inside a PCollection of failed mutations apart of
> >>> > the ‘done’ signal. Used by BigQueryIO and SpannerIO. A generics case
> >>> > of this one is used by FileIO for Destinations via:
> >>> > ‘WriteFilesResult<DestinationT>’.
> >>> >
> >>> > 5. Write returns ‘a class that implements POutput’ with specific data
> >>> > (no PCollections)
> >>> >
> >>> > This is similar to (4) but with the difference that the returned type
> >>> > contains the specific data that may be needed next, for example not a
> >>> > PCollection but values like the number of rows written. Used by
> >>> > BigtableIO (PR in review at the moment). (This can be seen as a
> >>> > simpler version of 4).
> >>
> >>
> >> Thanks Ismaël for detailing various approaches with examples.
> >>
> >> I think current PR for BigTable returns a PCollection<BigTableWriteResult> 
> >>  from a PTransform 'WithWriteResults' that can be optionally invoked 
> >> through a BigTableIO.Write.withWriteResults(). So this is more closer to 
> >> (6) Robert mentioned. But (1) was also discussed as an option. PR is 
> >> https://github.com/apache/beam/pull/7805 for anybody interested.
> >>
> >> I think (6) is less cumbersome to implement/use and allows us to easily 
> >> extend the transform through more chaining or by changing the return 
> >> transform through additional "with*" methods to the FooIO.Write class.
> >>
> >> Thanks,
> >> Cham
> >>
> >>> >
> >>> > I would like to have your opinions on which approach you think it is
> >>> > better or worse and arguments if you see other
> >>> > advantages/disadvantages. I am probably more in the (4) camp but I
> >>> > feel somehow attracted by the flexibility that the lack of strict
> >>> > typing brings in (2, 3) in case of changes to the public IO API (of
> >>> > course this can be contested too).
> >>> >
> >>> > Any other ideas, preferences, issues we may be missing?

Re: Return types of Write transforms (aka best way to signal)

Reply via email to