Re: Writing a new IO on beam, should I use the source API or SDF?

Boyuan Zhang Fri, 15 May 2020 16:23:24 -0700

Hi Steve,

Yes that's correct.


On Fri, May 15, 2020 at 2:11 PM Steve Niemitz <[email protected]> wrote:

> ah! ok awesome, I think that was the piece I was misunderstanding.  So I
> _can_ use a SDF to split the work initially (like I was manually doing in
> #1), but it just won't be further split dynamically on dataflow v1 right
> now.  Is my understanding there correct?
>
> On Fri, May 15, 2020 at 5:03 PM Luke Cwik <[email protected]> wrote:
>
>> #3 is the best when you implement @SplitRestriction on the SDF.
>>
>> The size of each restriction is used to better balance the splits within
>> Dataflow runner v2 so it is less susceptible to the too many or unbalanced
>> split problem.
>> For example, if you have 4 workers and make 20 splits, the splits will be
>> grouped based upon their sizes. So if 19 of those splits are small and 1 is
>> big, the 1 will execute by itself while the 19 will be done by the 3 other
>> workers.
>>
>> Also, dynamic work rebalancing isn't meant to replace those initial
>> splits but helps a lot with worker rebalancing since a few workers are
>> usually stragglers and will need some help at the end of a pipeline.
>>
>> On Fri, May 15, 2020 at 1:54 PM Steve Niemitz <[email protected]>
>> wrote:
>>
>>> Thanks for the replies so far.  I should have specifically mentioned
>>> above, I am building a bounded source.
>>>
>>> While I was thinking this through, I realized that I might not actually
>>> need any fancy splitting, since I can calculate all my split points up
>>> front.  I think this goes well with Ismaël's suggestion as well.
>>>
>>> I'm curious what the pros and cons would be of these options:
>>> 1) Presplit each file into N pieces (based on a target bundle size,
>>> similar to how it looks like the avro reader does it), using a
>>> standard DoFn to read each split.
>>> 2) Presplit, but use a SDF to support further splitting once it's
>>> supported in dataflow.  (this would also help if I have files that can't be
>>> split up front)
>>> 3) Don't pre-split, but use a SDF.
>>> 4) Use the source API
>>>
>>> I think we've covered 2 and 4 pretty well already, but curious
>>> specifically about the pre-split approach.  Thanks again so far!
>>>
>>> On Fri, May 15, 2020 at 1:11 PM Ismaël Mejía <[email protected]> wrote:
>>>
>>>> For the Bounded case if you do not have a straight forward way to split
>>>> at
>>>> fractions, or simply if you do not care about Dynamic Work Rebalancing.
>>>> You can
>>>> get away implementing a simple DoFn (without Restrictions) based
>>>> implementation
>>>> and evolve from it. More and more IOs at Beam are becoming DoFn based
>>>> (even if
>>>> not SDF) because you win the composability advantages.
>>>>
>>>> An interesting question is when should we start deprecating the Source
>>>> API and
>>>> encourage people to write only DoFn based IOs. I think we are getting
>>>> to the
>>>> maturity point where we can start this discussion.
>>>>
>>>> On Fri, May 15, 2020 at 4:59 PM Luke Cwik <[email protected]> wrote:
>>>> >
>>>> > If it is an unbounded source then SDF is a winner since you are not
>>>> giving up anything with it when compared to the legacy UnboundedSource API
>>>> since Dataflow doesn't support dynamic splitting of unbounded SDFs or
>>>> UnboundedSources (only initial splitting). You gain the ability to compose
>>>> sources and the initial splitting is done at pipeline execution for SDFs vs
>>>> pipeline construction time for UnboundedSource.
>>>> >
>>>> > If it is bounded, my gut is to still go with SDF since:
>>>> > * Dataflow runner V2 supports SDF fully
>>>> > * The Java/Python SDF APIs have gone through the majority of churn
>>>> already, there are some minor clean-ups and then I would like to remove the
>>>> @Experimental annotations from them after a discussion on dev@ about it
>>>> > * Being able to compose "sources" is immensely powerful
>>>> >
>>>> > The caveat is that Dataflow runner V1 doesn't support dynamic
>>>> splitting of SDFs today and depending on how well runner v2 rollout
>>>> happens, may never. The big plus with the legacy source API is that there
>>>> are already bounded/unbounded source wrappers that will convert them into
>>>> SDFs so you get all of runner v1 and runner v2 support for what the legacy
>>>> source API can do today but give up the composability and any splitting
>>>> support for unbounded SDFs that will come later.
>>>> >
>>>> > Finally, there is a way to get limited support for dynamic splitting
>>>> of bounded and unbounded SDFs for other runners using the composability of
>>>> SDFs and the limited depth splitting proposal[1].
>>>> >
>>>> > 1:
>>>> https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/edit#heading=h.wkwslng744mv
>>>> >
>>>> > On Fri, May 15, 2020 at 7:08 AM Steve Niemitz <[email protected]>
>>>> wrote:
>>>> >>
>>>> >> I'm going to be writing a new IO (in java) for reading files in a
>>>> custom format, and want to make it splittable.  It seems like I have a
>>>> choice between the "legacy" source API, and newer experimental SDF API.  Is
>>>> there any guidance on which I should use?  I can likely tolerate some API
>>>> churn as well in the SDF APIs.
>>>> >>
>>>> >> My target runner is dataflow.
>>>> >>
>>>> >> Thanks!
>>>>
>>>

Re: Writing a new IO on beam, should I use the source API or SDF?

Reply via email to