Hi Arun,

Ah yes.. the first comment by Owen O'Malley is exactly what I have in mind.

Thanks,

Felix Halim

On Wed, Feb 10, 2010 at 3:04 AM, Arun C Murthy <a...@yahoo-inc.com> wrote:
> Felix, you might want to follow
> https://issues.apache.org/jira/browse/MAPREDUCE-1434.
> We are discussing ideas very similar to what you've just described over
> there.
>
> Arun
>
> On Feb 8, 2010, at 9:49 PM, Felix Halim wrote:
>
>> Hi,
>>
>> Currently the barrier between r(i) and m(i+1) is the Job barrier.
>> That is, m(i+1) will be blocked until all r(i) finish (until Job i
>> finish).
>>
>> I'm saying this blocking is not necessary if we can concatenate them
>> all in a single Job as an endless chain.
>> Therefore m(i+1) can start immediately even when r(i) is not finished.
>>
>> The termination condition is when some counter after r(i) is finished is
>> zero.
>> Thus the result of m(i+1) is discarded.
>>
>> I don't know how to make it clearer than this...
>>
>> Felix Halim
>>
>> On Tue, Feb 9, 2010 at 1:41 PM, Amogh Vasekar <am...@yahoo-inc.com> wrote:
>>>
>>> Hi,
>>>>>
>>>>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>>>
>>> My understanding is it would be something like:
>>> m1|(r1 m2)| m(identity) | r2, if you combine the r(i) and m(i+1), because
>>> of
>>> the hard distinction between Rs & Ms.
>>>
>>> Amogh
>>>
>>>
>>> On 2/4/10 1:46 PM, "Felix Halim" <felix.ha...@gmail.com> wrote:
>>>
>>> Talking about barrier, currently there are barriers between anything:
>>>
>>> m1 | r1 | m2 | r2 | ... | mK | rK
>>>
>>> where | is the barrier.
>>>
>>> I'm saying that the barrier between ri and m(i+1) is not necessary.
>>> So it should go like this:
>>>
>>> m1 | r1 m2 | r2 m3 | ... | r(K-1) mK | rK m(K+1)
>>>
>>> Here the result of m(K+1) is throwed away.
>>> We take the result of rK only.
>>>
>>> The shuffling is needed only between mi and ri.
>>> There is no shuffling needed for ri and m(i+1).
>>>
>>> Thus by removing the barrier between ri and m(i+1), the overall job
>>> can be made faster.
>>>
>>> Now the question is, can this be done using Chaining?
>>> AFAIK, the chaining has to be defined before the job is started, right?
>>> But because I don't know the value of K beforehand,
>>> I want the chain to continue forever until some counter in reduce task is
>>> zero.
>>>
>>> Felix Halim
>>>
>>>
>>> On Thu, Feb 4, 2010 at 3:53 PM, Amogh Vasekar <am...@yahoo-inc.com>
>>> wrote:
>>>>
>>>>>> However, from ri to m(i+1) there is an unnecessary barrier. m(i+1)
>>>>>> should
>>>>>> not need to wait for all reducers ri to finish, right?
>>>>
>>>> Yes, but r(i+1) cant be in the same job, since that requires another
>>>> sort
>>>> and shuffle phase ( barrier ). So you would end up doing, job(i) :
>>>> m(i)r(i)m(i+1) . Job(i+1) : m(identity)r(i+1). Ofcourse, this is
>>>> assuming
>>>> you cant do r(i+1) in m(identity), for if you can then it doesn’t need
>>>> sort
>>>> and shuffle , and hence your job would be again of the form m+rm* :)
>>>>
>>>> Amogh
>>>>
>>>> On 2/4/10 10:19 AM, "Felix Halim" <felix.ha...@gmail.com> wrote:
>>>>
>>>> Hi Ed,
>>>>
>>>> Currently my program is like this:  m1,r1, m2,r2, ..., mK, rK. The
>>>> barrier between mi and ri is acceptable since reducer has to wait for
>>>> all map task to finish. However, from ri to m(i+1) there is an
>>>> unnecessary barrier. m(i+1) should not need to wait for all reducers
>>>> ri to finish, right?
>>>>
>>>> Currently, I created one Job for each mi,ri. So I have total of K
>>>> jobs. Is there a way to chain them all together into a single Job?
>>>> However, I don't know the value of K in advance. It has to be checked
>>>> after each ri.  So I'm thinking that the job can speculatively do the
>>>> chain over and over until it discover that some counter in ri is zero
>>>> (so the result of m(K+1) is discarded, and the final result of rK is
>>>> taken).
>>>>
>>>> Felix Halim
>>>>
>>>>
>>>> On Thu, Feb 4, 2010 at 12:25 PM, Ed Mazur <ma...@cs.umass.edu> wrote:
>>>>>
>>>>> Felix,
>>>>>
>>>>> You can use ChainMapper and ChainReducer to create jobs of the form
>>>>> M+RM*. Is that what you're looking for? I'm not aware of anything that
>>>>> allows you to have multiple reduce functions without the job
>>>>> "barrier".
>>>>>
>>>>> Ed
>>>>>
>>>>> On Wed, Feb 3, 2010 at 9:41 PM, Felix Halim <felix.ha...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> As far as I know, a barrier exists between map and reduce function in
>>>>>> one round of MR. There is another barrier for the reducer to end the
>>>>>> job for that round. However if we want to run in several rounds using
>>>>>> the same map and reduce functions, then the barrier between reduce and
>>>>>> the map of the next round is NOT necessary, right? Since the reducer
>>>>>> only output a single value for each key. This reducer may as well run
>>>>>> a map task for the next round immediately rather than waiting for all
>>>>>> reducer to finish. This way, the utilization of the machines between
>>>>>> rounds can be improved.
>>>>>>
>>>>>> Is there a setting in Hadoop to do that?
>>>>>>
>>>>>> Felix Halim
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>
>

Reply via email to