Hi Arun, Ah yes.. the first comment by Owen O'Malley is exactly what I have in mind.
Thanks, Felix Halim On Wed, Feb 10, 2010 at 3:04 AM, Arun C Murthy <a...@yahoo-inc.com> wrote: > Felix, you might want to follow > https://issues.apache.org/jira/browse/MAPREDUCE-1434. > We are discussing ideas very similar to what you've just described over > there. > > Arun > > On Feb 8, 2010, at 9:49 PM, Felix Halim wrote: > >> Hi, >> >> Currently the barrier between r(i) and m(i+1) is the Job barrier. >> That is, m(i+1) will be blocked until all r(i) finish (until Job i >> finish). >> >> I'm saying this blocking is not necessary if we can concatenate them >> all in a single Job as an endless chain. >> Therefore m(i+1) can start immediately even when r(i) is not finished. >> >> The termination condition is when some counter after r(i) is finished is >> zero. >> Thus the result of m(i+1) is discarded. >> >> I don't know how to make it clearer than this... >> >> Felix Halim >> >> On Tue, Feb 9, 2010 at 1:41 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote: >>> >>> Hi, >>>>> >>>>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1) >>> >>> My understanding is it would be something like: >>> m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1), because >>> of >>> the hard distinction between Rs & Ms. >>> >>> Amogh >>> >>> >>> On 2/4/10 1:46 PM, "Felix Halim" <felix.ha...@gmail.com> wrote: >>> >>> Talking about barrier, currently there are barriers between anything: >>> >>> m1 | r1 | m2 | r2 | ... | mK | rK >>> >>> where | is the barrier. >>> >>> I'm saying that the barrier between ri and m(i+1) is not necessary. >>> So it should go like this: >>> >>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1) >>> >>> Here the result of m(K+1) is throwed away. >>> We take the result of rK only. >>> >>> The shuffling is needed only between mi and ri. >>> There is no shuffling needed for ri and m(i+1). >>> >>> Thus by removing the barrier between ri and m(i+1), the overall job >>> can be made faster. >>> >>> Now the question is, can this be done using Chaining? >>> AFAIK, the chaining has to be defined before the job is started, right? >>> But because I don't know the value of K beforehand, >>> I want the chain to continue forever until some counter in reduce task is >>> zero. >>> >>> Felix Halim >>> >>> >>> On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <am...@yahoo-inc.com> >>> wrote: >>>> >>>>>> However, from ri to m(i+1) there is an unnecessary barrier. m(i+1) >>>>>> should >>>>>> not need to wait for all reducers ri to finish, right? >>>> >>>> Yes, but r(i+1) cant be in the same job, since that requires another >>>> sort >>>> and shuffle phase ( barrier ). So you would end up doing, job(i) : >>>> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is >>>> assuming >>>> you cant do r(i+1) in m(identity), for if you can then it doesn’t need >>>> sort >>>> and shuffle , and hence your job would be again of the form m+rm* :) >>>> >>>> Amogh >>>> >>>> On 2/4/10 10:19 AM, "Felix Halim" <felix.ha...@gmail.com> wrote: >>>> >>>> Hi Ed, >>>> >>>> Currently my program is like this: m1,r1, m2,r2, ..., mK, rK. The >>>> barrier between mi and ri is acceptable since reducer has to wait for >>>> all map task to finish. However, from ri to m(i+1) there is an >>>> unnecessary barrier. m(i+1) should not need to wait for all reducers >>>> ri to finish, right? >>>> >>>> Currently, I created one Job for each mi,ri. So I have total of K >>>> jobs. Is there a way to chain them all together into a single Job? >>>> However, I don't know the value of K in advance. It has to be checked >>>> after each ri. So I'm thinking that the job can speculatively do the >>>> chain over and over until it discover that some counter in ri is zero >>>> (so the result of m(K+1) is discarded, and the final result of rK is >>>> taken). >>>> >>>> Felix Halim >>>> >>>> >>>> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu> wrote: >>>>> >>>>> Felix, >>>>> >>>>> You can use ChainMapper and ChainReducer to create jobs of the form >>>>> M+RM*. Is that what you're looking for? I'm not aware of anything that >>>>> allows you to have multiple reduce functions without the job >>>>> "barrier". >>>>> >>>>> Ed >>>>> >>>>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <felix.ha...@gmail.com> >>>>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> As far as I know, a barrier exists between map and reduce function in >>>>>> one round of MR. There is another barrier for the reducer to end the >>>>>> job for that round. However if we want to run in several rounds using >>>>>> the same map and reduce functions, then the barrier between reduce and >>>>>> the map of the next round is NOT necessary, right? Since the reducer >>>>>> only output a single value for each key. This reducer may as well run >>>>>> a map task for the next round immediately rather than waiting for all >>>>>> reducer to finish. This way, the utilization of the machines between >>>>>> rounds can be improved. >>>>>> >>>>>> Is there a setting in Hadoop to do that? >>>>>> >>>>>> Felix Halim >>>>>> >>>>> >>>> >>>> >>> >>> > >