+1 to Jeff's suggestions, especially on locality. I'd love to see some rigorous work done so that the scheduler could prefer distributing tasks to the nodes that are already hosting the appropriate data. Generalizing this further so that a full vertical integration of HDFS, Hbase, and Map/Reduce could exploit maximal data locality would be even cooler.
Chad On 2/24/08 2:56 PM, "Jeff Hammerbacher" <[EMAIL PROTECTED]> wrote: Hey Jaideep, One interesting direction for research would be more sophisticated scheduling policies for the JobTracker to help improve locality and overall cluster utilization. The introduction of speculative execution is a step in this direction; you could perhaps investigate the implications of different speculative execution policies on different job types. Regards, Jeff On Sun, Feb 24, 2008 at 9:41 AM, Jaideep Dhok <[EMAIL PROTECTED]> wrote: > Hello, > I am a graduate research student in CS at the Search and Information > Extraction Lab, in IIIT Hyderabad, India (http://search.iiit.ac.in). I > have > been working on Nutch and Hadoop for the past couple of months, basically > to > get an understanding of the platform, and to discover possible research > areas for my thesis work. Most of the time I have been playing with the > Hadoop code base, and by now I am pretty much familiar with the internals > (especially the Map-Reduce part). > > I have been reading publications related to Map-Reduce and the Google file > system etc, and I am still looking for interesing research topics. I was > wondering if anyone would like to share/suggest any ideas related to the > Hadoop plaform. Any suggestions and comments are greatly appreciated. > > Thanks and Regards, > Jaideep Dhok, >
