Re: Writing a new IO on beam, should I use the source API or SDF?

Luke Cwik Fri, 15 May 2020 08:00:27 -0700

If it is an unbounded source then SDF is a winner since you are not giving
up anything with it when compared to the legacy UnboundedSource API since
Dataflow doesn't support dynamic splitting of unbounded SDFs or
UnboundedSources (only initial splitting). You gain the ability to compose
sources and the initial splitting is done at pipeline execution for SDFs vs
pipeline construction time for UnboundedSource.

If it is bounded, my gut is to still go with SDF since:
* Dataflow runner V2 supports SDF fully
* The Java/Python SDF APIs have gone through the majority of churn already,
there are some minor clean-ups and then I would like to remove the
@Experimental annotations from them after a discussion on dev@ about it
* Being able to compose "sources" is immensely powerful

The caveat is that Dataflow runner V1 doesn't support dynamic splitting of
SDFs today and depending on how well runner v2 rollout happens, may never.
The big plus with the legacy source API is that there are already
bounded/unbounded source wrappers that will convert them into SDFs so you
get all of runner v1 and runner v2 support for what the legacy source API
can do today but give up the composability and any splitting support for
unbounded SDFs that will come later.

Finally, there is a way to get limited support for dynamic splitting of
bounded and unbounded SDFs for other runners using the composability of
SDFs and the limited depth splitting proposal[1].

1:
https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/edit#heading=h.wkwslng744mv

On Fri, May 15, 2020 at 7:08 AM Steve Niemitz <[email protected]> wrote:

> I'm going to be writing a new IO (in java) for reading files in a custom
> format, and want to make it splittable.  It seems like I have a choice
> between the "legacy" source API, and newer experimental SDF API.  Is there
> any guidance on which I should use?  I can likely tolerate some API churn
> as well in the SDF APIs.
>
> My target runner is dataflow.
>
> Thanks!
>

Re: Writing a new IO on beam, should I use the source API or SDF?

Reply via email to