Thanks for the replies so far. I should have specifically mentioned above, I am building a bounded source.
While I was thinking this through, I realized that I might not actually need any fancy splitting, since I can calculate all my split points up front. I think this goes well with Ismaël's suggestion as well. I'm curious what the pros and cons would be of these options: 1) Presplit each file into N pieces (based on a target bundle size, similar to how it looks like the avro reader does it), using a standard DoFn to read each split. 2) Presplit, but use a SDF to support further splitting once it's supported in dataflow. (this would also help if I have files that can't be split up front) 3) Don't pre-split, but use a SDF. 4) Use the source API I think we've covered 2 and 4 pretty well already, but curious specifically about the pre-split approach. Thanks again so far! On Fri, May 15, 2020 at 1:11 PM Ismaël Mejía <[email protected]> wrote: > For the Bounded case if you do not have a straight forward way to split at > fractions, or simply if you do not care about Dynamic Work Rebalancing. > You can > get away implementing a simple DoFn (without Restrictions) based > implementation > and evolve from it. More and more IOs at Beam are becoming DoFn based > (even if > not SDF) because you win the composability advantages. > > An interesting question is when should we start deprecating the Source API > and > encourage people to write only DoFn based IOs. I think we are getting to > the > maturity point where we can start this discussion. > > On Fri, May 15, 2020 at 4:59 PM Luke Cwik <[email protected]> wrote: > > > > If it is an unbounded source then SDF is a winner since you are not > giving up anything with it when compared to the legacy UnboundedSource API > since Dataflow doesn't support dynamic splitting of unbounded SDFs or > UnboundedSources (only initial splitting). You gain the ability to compose > sources and the initial splitting is done at pipeline execution for SDFs vs > pipeline construction time for UnboundedSource. > > > > If it is bounded, my gut is to still go with SDF since: > > * Dataflow runner V2 supports SDF fully > > * The Java/Python SDF APIs have gone through the majority of churn > already, there are some minor clean-ups and then I would like to remove the > @Experimental annotations from them after a discussion on dev@ about it > > * Being able to compose "sources" is immensely powerful > > > > The caveat is that Dataflow runner V1 doesn't support dynamic splitting > of SDFs today and depending on how well runner v2 rollout happens, may > never. The big plus with the legacy source API is that there are already > bounded/unbounded source wrappers that will convert them into SDFs so you > get all of runner v1 and runner v2 support for what the legacy source API > can do today but give up the composability and any splitting support for > unbounded SDFs that will come later. > > > > Finally, there is a way to get limited support for dynamic splitting of > bounded and unbounded SDFs for other runners using the composability of > SDFs and the limited depth splitting proposal[1]. > > > > 1: > https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/edit#heading=h.wkwslng744mv > > > > On Fri, May 15, 2020 at 7:08 AM Steve Niemitz <[email protected]> > wrote: > >> > >> I'm going to be writing a new IO (in java) for reading files in a > custom format, and want to make it splittable. It seems like I have a > choice between the "legacy" source API, and newer experimental SDF API. Is > there any guidance on which I should use? I can likely tolerate some API > churn as well in the SDF APIs. > >> > >> My target runner is dataflow. > >> > >> Thanks! >
