On 15/09/11 02:01, Bharath Ravi wrote:
Thanks a lot, all!

An end goal of mine was to make Hadoop as flexible as possible.
Along the same lines, but unrelated to the above idea, was another I
encountered,
courtesy http://hadoopblog.blogspot.com/2010/11/hadoop-research-topics.html

The blog mentions the ability to dynamically append Input.
Specifically, can I append input to the Map and Reduce tasks after they've
been started?

Dhruba is referring to something that they've actually implemented in their version of Hive, which is the ability to gradually increase the data input to a running Hive job.

This lets them do a query like "find 8 friends in california" without searching the entire dataset; pick a subset, search that, and if there are enough results, stop. If not: feed in some more data.

I have a paper on it that shows that for data with little or no skew, this is much faster than a full scan; for skewed data where all the results are in a subset of blocks it is about the same as a full scan -it depends on which block size is found.

I haven't been able to find something like this at a precursory glance, but
could someone
advice me on this before I dig deeper?

1. Does such functionality exist, or is it being attempted?

It exists for Hive though not in trunk, to get it in there would be mostly a matter of taking the existing code and slotting it in.

2. I would assume most cases would simply require starting a second Job for
the new input.

No, because that loses all existing work and requires rescheduling more work. The goal of this is to execute one job that can bail out early.

The Facebook code runs with Hive, for classic MR jobs the first step would be to allow Map tasks to finish early. I think there may be a way to do that and plan to do some experiments to see if I'm right.

What would be more dramatic would be for the JT to be aware that jobs may finish early and have it slowly ramp up the map operations if they don't set some "finished" flag (which would presumably be a shared counter), until the entire dataset gets processed if the early finish doesn't work. This slow-start could be taken into account in the scheduler which could than know that the initial resource needs of the Job are quite low, but may increase.

However, are there practical use cases to such a feature?

See above

3. Are there any other ideas on such "flexibility" of the system that I
could contribute to?

While it's great that you want to do big things in Hadoop, I'd recommend you start using it and learning your way around the codebase -especially of SVN trunk or the unreleased 0.23 branch, as they are where all major changes will go, and the MR engine has been radically reworked for better scheduling.

Start writing MR jobs that work under the new engine, using existing public datasets, or look at the layers above, then think how things could be improved.

Reply via email to