Re: Preferred locations (or data locality) for batch pipelines.

Jesse Anderson Thu, 22 Sep 2016 07:46:30 -0700

I've only ever seen that being used to figure out which file the
runner/mapper/operation is working on. Otherwise, I haven't seen those
operations care where in the file they're working.


On Thu, Sep 22, 2016 at 5:57 AM Amit Sela <amitsel...@gmail.com> wrote:

> Wouldn't it force all runners to implement this for all distributed
> filesystems ? It's true that each runner has it's own "partitioning"
> mechanism, but I assume (maybe I'm wrong) that open-source runners use the
> Hadoop InputFormat/InputSplit for that.. and the proper connectors for that
> to run on top of s3/gs.
>
> If this is wrong, each runner should take care of it's own, but if not, we
> could have a generic solution for runners, no ?
>
> Thanks,
> Amit
>
> On Thu, Sep 22, 2016 at 3:30 PM Jean-Baptiste Onofré <j...@nanthrax.net>
> wrote:
>
> > Hi Amit,
> >
> > as the purpose is to remove IOChannelFactory, then I would suggest it's
> > a runner concern. The Read.Bounded should "locate" the bundles on a
> > executor close to the read data (even if it's not always possible
> > depending of the source).
> >
> > My $0.01
> >
> > Regards
> > JB
> >
> > On 09/22/2016 02:26 PM, Amit Sela wrote:
> > > It's not new that batch pipeline can optimize on data locality, my
> > question
> > > is regarding this responsibility in Beam.
> > > If runners should implement a generic Read.Bounded support, should they
> > > also implement locating the input blocks ? or should it be a part
> > > of IOChannelFactory implementations ? or another way to go at it that
> I'm
> > > missing ?
> > >
> > > Thanks,
> > > Amit.
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: Preferred locations (or data locality) for batch pipelines.

Reply via email to