This makes sense until you realize:

                a) It won't scale.

                b) Machines fail.

On Dec 20, 2010, at 5:26 AM, Martin Becker wrote:

> I wrote a little bit much, so I put a summary up front. Sorry about that.
> 
> Summary:
> 1) Is there any point in time, where one single instance of Hadoop has
> access to all keys that are to be distributed to the nodes together
> with corresponding data? Or maybe at least nodes could have Task
> priorities, killing and rescheduling tasks, if higher priority tasks
> arrive (key word: Partitioner, TaskScheduler).
> 
> 2) Blocking running tasks, does not get me anywhere, if they are not
> suspended allowing other Reducer to take their place as this is taking
> up Reducer slots, isn't it? The main problem are Reducers waiting for
> a slot.
> 
> 3) Why Reducer ordering (could) affect(s) processing.
> 
> ad 1)
> 1.1) The only thing the JobTracker would need to do is to look at keys
> and derive some job internal order of Reduce tasks. At this point it
> would be necessary for the JobTracker (or _any other_ instance which
> would be able to do such a thing!) to know about how many Reducers are
> to start for a specific job, what their keys are or at least about
> their priority.
> 
> 1.2) At some point the Partitioner is distributing keys to nodes.
> Meaning it could at least group high quality with low quality tasks
> (based on some criterion). Now cannot _for example_ the TaskTrackers
> themselves decide which task - that was assigned to them by the
> Partitioner - to execute first?
> 
> 1.3) And the question is basically, is there some instance that CAN do
> some prioritizing of Tasks, the way I want it to. Even if it is only
> in combination with the Partitioner: "Oh, a Task with lower priority
> is running. So kill it, and restart it later." Maybe this would work
> using something else but the FairScheduler. I am making wild guesses
> here, but I think I am drifting towards the TaskScheduler, if it
> actually does what I think it does.
> 
> ad 2) Blocking:
> If I have enough slots for all the Reduce tasks, I have no problem at
> all. There is no sense at all in starting a task and then blocking it.
> Why not let it run? It is not like the Reducer have to wait for some
> other to finish. They could just quit working/not even start, if there
> output is redundant (see "ad 3)").
> 
> ad 3) This is why Reducer ordering affects processing:
> Preliminaries:
> * Each Reducer raises a (global! - using ZooKeeper, FileSystem or
> maybe Counters) threshold.
> * Each Reducer can estimate if it will ever pass a given threshold.
> * Output of Reducers that cannot pass the threshold is discarded.
> * Some Reducers have a higher probability (by Key) to raise the
> threshold faster.
> 
> As a result it would make sense to run Reducers with a higher
> probability to raise the threshold first. Reducers can cease their
> work or not even start, if they cannot pass the threshold anymore.
> 
> 
> On Mon, Dec 20, 2010 at 11:58 AM, Harsh J <qwertyman...@gmail.com> wrote:
>> The JobTracker wouldn't know what your data is going to be is when it
>> is assigning the Reduce Tasks.
>> 
>> If you really do need ordering among your reducers, you should
>> implement a locking mechanism (making sure the dormant reduce tasks
>> stay alive by sending out some status reports).
>> 
>> Although, how is ordering going to affect your reducer's processing? :)
>> 
>> On Mon, Dec 20, 2010 at 2:37 PM, Martin Becker <_martinbec...@web.de> wrote:
>>> I just reread my first post. Maybe I was not clear enough:
>>> It is only important to me that the Reduce tasks _start_ in a
>>> specified order based on their key. That is the only additional
>>> constraint I need.
>>> 
>>> On Mon, Dec 20, 2010 at 9:51 AM, Martin Becker <_martinbec...@web.de> wrote:
>>>> As far as I understood, MapReduce is waiting for all Mappers to finish
>>>> until it starts running Reduce tasks. Am I mistaken here? If I am not,
>>>> then I do not see any more synchrony being introduced than there
>>>> already is (no locks required). Of course I am not aware of all the
>>>> internals, but MapReduce is working with a single JobTracker, which
>>>> distributes Reduce tasks to the different nodes (see
>>>> http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Overview).
>>>> So the only point where my "theory" would break is, if Reducer start
>>>> before Mappers finish. Otherwise the JobTracker should be able to
>>>> schedule Reduce tasks in a specific order.
>>>> 
>>>> On Mon, Dec 20, 2010 at 4:45 AM, Harsh J <qwertyman...@gmail.com> wrote:
>>>>> You could use sort of a distributed lock service to achieve this
>>>>> (ZooKeeper can help). But such things ought to be avoided as David
>>>>> pointed out above.
>>>>> 
>>>>> On Sun, Dec 19, 2010 at 9:09 PM, Martin Becker <_martinbec...@web.de> 
>>>>> wrote:
>>>>>> Hello everybody,
>>>>>> 
>>>>>> is there a possibility to make sure that certain/all reduce tasks,
>>>>>> i.e. the reducers to certain keys, are executed in a specified order?
>>>>>> This is Job internal, so the Job Scheduler is probably the wrong place 
>>>>>> to start?
>>>>>> Does the order induced by the Comparable interface influence the
>>>>>> execution order at all?
>>>>>> 
>>>>>> Thanks in advance,
>>>>>> Martin
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Harsh J
>>>>> www.harshj.com
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Harsh J
>> www.harshj.com
>> 

Reply via email to