Re: Writing a new IO on beam, should I use the source API or SDF?

Ismaël Mejía Fri, 15 May 2020 10:12:29 -0700

For the Bounded case if you do not have a straight forward way to split at
fractions, or simply if you do not care about Dynamic Work Rebalancing. You can
get away implementing a simple DoFn (without Restrictions) based implementation
and evolve from it. More and more IOs at Beam are becoming DoFn based (even if
not SDF) because you win the composability advantages.


An interesting question is when should we start deprecating the Source API and
encourage people to write only DoFn based IOs. I think we are getting to the
maturity point where we can start this discussion.

On Fri, May 15, 2020 at 4:59 PM Luke Cwik <[email protected]> wrote:
>
> If it is an unbounded source then SDF is a winner since you are not giving up 
> anything with it when compared to the legacy UnboundedSource API since 
> Dataflow doesn't support dynamic splitting of unbounded SDFs or 
> UnboundedSources (only initial splitting). You gain the ability to compose 
> sources and the initial splitting is done at pipeline execution for SDFs vs 
> pipeline construction time for UnboundedSource.
>
> If it is bounded, my gut is to still go with SDF since:
> * Dataflow runner V2 supports SDF fully
> * The Java/Python SDF APIs have gone through the majority of churn already, 
> there are some minor clean-ups and then I would like to remove the 
> @Experimental annotations from them after a discussion on dev@ about it
> * Being able to compose "sources" is immensely powerful
>
> The caveat is that Dataflow runner V1 doesn't support dynamic splitting of 
> SDFs today and depending on how well runner v2 rollout happens, may never. 
> The big plus with the legacy source API is that there are already 
> bounded/unbounded source wrappers that will convert them into SDFs so you get 
> all of runner v1 and runner v2 support for what the legacy source API can do 
> today but give up the composability and any splitting support for unbounded 
> SDFs that will come later.
>
> Finally, there is a way to get limited support for dynamic splitting of 
> bounded and unbounded SDFs for other runners using the composability of SDFs 
> and the limited depth splitting proposal[1].
>
> 1: 
> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/edit#heading=h.wkwslng744mv
>
> On Fri, May 15, 2020 at 7:08 AM Steve Niemitz <[email protected]> wrote:
>>
>> I'm going to be writing a new IO (in java) for reading files in a custom 
>> format, and want to make it splittable.  It seems like I have a choice 
>> between the "legacy" source API, and newer experimental SDF API.  Is there 
>> any guidance on which I should use?  I can likely tolerate some API churn as 
>> well in the SDF APIs.
>>
>> My target runner is dataflow.
>>
>> Thanks!

Re: Writing a new IO on beam, should I use the source API or SDF?

Reply via email to