Re: splitIntoBundles vs. generateInitialSplits

Stas Levin Wed, 11 Jan 2017 07:27:18 -0800

Eugene, that makes a lot of sense to me.

Do you think it's worth filing a Jira ticket?

On Wed, Jan 11, 2017 at 2:58 AM Eugene Kirpichov
<kirpic...@google.com.invalid> wrote:

I agree that the methods are named somewhat confusingly, and ideally would
be named the same. Both of the names miss some aspect of the underlying
concept.

The underlying concept is split the source into smaller sub-sources which,
if you read all of them, would have read the same data as the original one.
"splitIntoBundles" assumes that 1 source = 1 bundle, which is completely
false in streaming, and only partially true in batch (I'm talking about the
Dataflow runner).
"generateInitialSplits" assumes that this splitting happens only
"initially", i.e. at job startup time. This is currently true in practice
for all existing runners, but it doesn't have to be - we could conceivably
call it again at some point during the job if we see that some of the
sub-sources are still too large.

The analogous method in Splittable DoFn (
https://s.apache.org/splittable-do-fn) is called @SplitRestriction, but
there are no restrictions in source API, only sources.

Perhaps both should be called simply "split", or "splitIntoSubSources".

On Mon, Jan 9, 2017 at 2:12 PM Stas Levin <stasle...@gmail.com> wrote:

> Definitely seems like the formatting got lost in translation, sorry about
> that :)
>
> I guess both cases (methods) create splits, which are essentially a list
of
> bounded/unbounded source instances, each responsible for reading certain
> segments (physical or otherwise) of the data.
>
> On Mon, Jan 9, 2017 at 11:51 PM Stephen Sisk <s...@google.com.invalid>
> wrote:
>
> > hi!
> >
> > I think your strikethrough got lost due to this being a text-only email
> > list. To make sure, I think you're asking the following:
> > " would it be reasonable to think of splitIntoBundles as generateSplits?
> "
> > (ie, you strikethrough'd Initial)
> >
> > They are very similar and I definitely also think of them as occupying
> the
> > same niche. I'll let someone else who was around for naming discuss
> whether
> > it was intentional or not. Conceptually, the way that bounded vs
> streaming
> > are handled means that they are doing slightly different things: a
> bounded
> > source is really kind of creating physical chunks of the data, whereas
> the
> > streaming source is creating conceptual divisions of the data that will
> be
> > used later. I'm not sure that's worth the confusion caused by the
> > differences.
> >
> > One thing to clarify - splitIntoBundles does have an "Initial" aspect to
> > it. I don't believe there is a publicly defined/written down order the
> > Sources & Reader methods are called in, but a runner trying to get
> > efficiency would be able to use splitIntoBundles during job startup to
be
> > able to split up the work before creating readers rather than after
> > creating readers and waiting to use splitAtFraction.
> >
> > S
> >
> > On Sun, Jan 8, 2017 at 6:06 AM Stas Levin <stasle...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > A short terminology question regarding "bundle", and
> > > particularly splitIntoBundles vs. generateInitialSplits.
> > >
> > > In *BoundedSource* we have:
> > > List<? extends BoundedSource<T>> *splitIntoBundles*(...)
> > >
> > > In *UnboundedSource* we have:
> > > List<? extends UnboundedSource<OutputT, CheckpointMarkT>>
> > > *generateInitialSplits*(...)
> > >
> > > I was wondering if the names were intentionally made different, i.e.
> > "into
> > > bundles" vs "into splits"?
> > > In a way these two methods carry out a very similar task, would it be
> > > reasonable to think of *splitIntoBundles *as *generate*Initial*Splits?
> *
> > > (strikethrough due to "initial" not being applicable in the case of
> > bounded
> > > sources)
> > >
> > > Regards,
> > > Stas
> > >
> >
>

Re: splitIntoBundles vs. generateInitialSplits

Reply via email to