This makes sense until you realize: a) It won't scale.
b) Machines fail. On Dec 20, 2010, at 5:26 AM, Martin Becker wrote: > I wrote a little bit much, so I put a summary up front. Sorry about that. > > Summary: > 1) Is there any point in time, where one single instance of Hadoop has > access to all keys that are to be distributed to the nodes together > with corresponding data? Or maybe at least nodes could have Task > priorities, killing and rescheduling tasks, if higher priority tasks > arrive (key word: Partitioner, TaskScheduler). > > 2) Blocking running tasks, does not get me anywhere, if they are not > suspended allowing other Reducer to take their place as this is taking > up Reducer slots, isn't it? The main problem are Reducers waiting for > a slot. > > 3) Why Reducer ordering (could) affect(s) processing. > > ad 1) > 1.1) The only thing the JobTracker would need to do is to look at keys > and derive some job internal order of Reduce tasks. At this point it > would be necessary for the JobTracker (or _any other_ instance which > would be able to do such a thing!) to know about how many Reducers are > to start for a specific job, what their keys are or at least about > their priority. > > 1.2) At some point the Partitioner is distributing keys to nodes. > Meaning it could at least group high quality with low quality tasks > (based on some criterion). Now cannot _for example_ the TaskTrackers > themselves decide which task - that was assigned to them by the > Partitioner - to execute first? > > 1.3) And the question is basically, is there some instance that CAN do > some prioritizing of Tasks, the way I want it to. Even if it is only > in combination with the Partitioner: "Oh, a Task with lower priority > is running. So kill it, and restart it later." Maybe this would work > using something else but the FairScheduler. I am making wild guesses > here, but I think I am drifting towards the TaskScheduler, if it > actually does what I think it does. > > ad 2) Blocking: > If I have enough slots for all the Reduce tasks, I have no problem at > all. There is no sense at all in starting a task and then blocking it. > Why not let it run? It is not like the Reducer have to wait for some > other to finish. They could just quit working/not even start, if there > output is redundant (see "ad 3)"). > > ad 3) This is why Reducer ordering affects processing: > Preliminaries: > * Each Reducer raises a (global! - using ZooKeeper, FileSystem or > maybe Counters) threshold. > * Each Reducer can estimate if it will ever pass a given threshold. > * Output of Reducers that cannot pass the threshold is discarded. > * Some Reducers have a higher probability (by Key) to raise the > threshold faster. > > As a result it would make sense to run Reducers with a higher > probability to raise the threshold first. Reducers can cease their > work or not even start, if they cannot pass the threshold anymore. > > > On Mon, Dec 20, 2010 at 11:58 AM, Harsh J <qwertyman...@gmail.com> wrote: >> The JobTracker wouldn't know what your data is going to be is when it >> is assigning the Reduce Tasks. >> >> If you really do need ordering among your reducers, you should >> implement a locking mechanism (making sure the dormant reduce tasks >> stay alive by sending out some status reports). >> >> Although, how is ordering going to affect your reducer's processing? :) >> >> On Mon, Dec 20, 2010 at 2:37 PM, Martin Becker <_martinbec...@web.de> wrote: >>> I just reread my first post. Maybe I was not clear enough: >>> It is only important to me that the Reduce tasks _start_ in a >>> specified order based on their key. That is the only additional >>> constraint I need. >>> >>> On Mon, Dec 20, 2010 at 9:51 AM, Martin Becker <_martinbec...@web.de> wrote: >>>> As far as I understood, MapReduce is waiting for all Mappers to finish >>>> until it starts running Reduce tasks. Am I mistaken here? If I am not, >>>> then I do not see any more synchrony being introduced than there >>>> already is (no locks required). Of course I am not aware of all the >>>> internals, but MapReduce is working with a single JobTracker, which >>>> distributes Reduce tasks to the different nodes (see >>>> http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html#Overview). >>>> So the only point where my "theory" would break is, if Reducer start >>>> before Mappers finish. Otherwise the JobTracker should be able to >>>> schedule Reduce tasks in a specific order. >>>> >>>> On Mon, Dec 20, 2010 at 4:45 AM, Harsh J <qwertyman...@gmail.com> wrote: >>>>> You could use sort of a distributed lock service to achieve this >>>>> (ZooKeeper can help). But such things ought to be avoided as David >>>>> pointed out above. >>>>> >>>>> On Sun, Dec 19, 2010 at 9:09 PM, Martin Becker <_martinbec...@web.de> >>>>> wrote: >>>>>> Hello everybody, >>>>>> >>>>>> is there a possibility to make sure that certain/all reduce tasks, >>>>>> i.e. the reducers to certain keys, are executed in a specified order? >>>>>> This is Job internal, so the Job Scheduler is probably the wrong place >>>>>> to start? >>>>>> Does the order induced by the Comparable interface influence the >>>>>> execution order at all? >>>>>> >>>>>> Thanks in advance, >>>>>> Martin >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Harsh J >>>>> www.harshj.com >>>>> >>>> >>> >> >> >> >> -- >> Harsh J >> www.harshj.com >>