Yes, your understanding is correct.
From: Jay Sen <>
Sent: Thursday, August 15, 2019 11:05 AM
To: <>
Subject: Re: Read locality on gobblin jobs

Hi Kuai,

looks like the inout to the mapreduce is the workunit file, not the actual
data, which will be fetched by the mapper gobblin task onces the mapper
container is spinned up.  which make sense given the data is not available
before the task executes, but in case the data is local in the same gobblin
cluster, i believe the data locality for actual data wont be gauranteed or
tried for.

I havent explored the yarn mode, but i believe yarn containers are spin up
before the job submission so there is no way to achieve data locality there
as well.

unless, for such scenario, we make it a 2 step process ( as hive does ), or
create reducers to achieve data locality.

am i right in my understanding here?, pls comment.


Reply from Kuai Yui:

The Helix framework in cluster mode doesn't have data locality concept. I
that is only in YARN/MR mode.

On Sun, Aug 11, 2019 at 5:25 PM Jay Sen <> wrote:

> Hi Dev Team,
> when gobblin runs on cluster mode or MR mode, if the job requires to reads
> data from the hadoop filesystem which is local, i.e on the same gobblin
> cluster, does gobblin or Helix figures out the data locality automatically
> ( as in typical MR job ) ?
> I doubt if this is the case, but just wanted to get some info on whether
> there is a way to achieve it anyway.
> some more context:
> I am preparing for following Gobblin pipeline.
> *external hadoop cluster *---job 1---> *gobblin hadoop cluster-1* ---job
> 2---> *gobblin hadoop cluster-2* ----job 3 --> *target platform*
> (oracle/mysql)
> here job 1,2,3 are the gobblin jobs.
> looking for suggestion on which gobblin mode would be best for this
> scenario as well.
> currently looking at Gobblin cluster mode and MR mode.
> Thanks
> Jay

Reply via email to