Yes, your understanding is correct. ________________________________ From: Jay Sen <jay...@apache.org> Sent: Thursday, August 15, 2019 11:05 AM To: dev@gobblin.incubator.apache.org <dev@gobblin.incubator.apache.org> Subject: Re: Read locality on gobblin jobs
Hi Kuai, looks like the inout to the mapreduce is the workunit file, not the actual data, which will be fetched by the mapper gobblin task onces the mapper container is spinned up. which make sense given the data is not available before the task executes, but in case the data is local in the same gobblin cluster, i believe the data locality for actual data wont be gauranteed or tried for. I havent explored the yarn mode, but i believe yarn containers are spin up before the job submission so there is no way to achieve data locality there as well. unless, for such scenario, we make it a 2 step process ( as hive does ), or create reducers to achieve data locality. am i right in my understanding here?, pls comment. Thanks Jay Reply from Kuai Yui: The Helix framework in cluster mode doesn't have data locality concept. I think that is only in YARN/MR mode. On Sun, Aug 11, 2019 at 5:25 PM Jay Sen <jay...@apache.org> wrote: > Hi Dev Team, > > when gobblin runs on cluster mode or MR mode, if the job requires to reads > data from the hadoop filesystem which is local, i.e on the same gobblin > cluster, does gobblin or Helix figures out the data locality automatically > ( as in typical MR job ) ? > I doubt if this is the case, but just wanted to get some info on whether > there is a way to achieve it anyway. > > some more context: > I am preparing for following Gobblin pipeline. > > *external hadoop cluster *---job 1---> *gobblin hadoop cluster-1* ---job > 2---> *gobblin hadoop cluster-2* ----job 3 --> *target platform* > (oracle/mysql) > > here job 1,2,3 are the gobblin jobs. > > looking for suggestion on which gobblin mode would be best for this > scenario as well. > currently looking at Gobblin cluster mode and MR mode. > > Thanks > Jay > > >