Hi Danushka, On Mon, Feb 25, 2013 at 4:48 PM, Danushka Menikkumbura < [email protected]> wrote:
> Hi Devs, > > I am looking into extending Big Data capabilities of Airavata as my M.Sc. > research work. I have identified certain possibilities and am going to > start with integrating Apache Hadoop (and Hadoop-like frameworks) with > Airavata. > > According to what I have understood, the best approach would be to have a > new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can > have a new parameter in the ApplicationContext (say TargetApplication) to > define the target application type and resolve correct provider in the GFac > Scheduler based on that. I see that having this capability in the Scheduler > class is already a TODO. I have been able to do these changes locally and > invoke a simple Hadoop job using GFac. Thus, I can assure that this > approach is viable except for any other implication that I am missing. > > I think we can store Hadoop job definitions in the Airavata Registry where > each definition would essentially include a unique identifier and other > attributes like mapper, reducer, sorter, formaters, etc that can be defined > using XBaya. Information about these building blocks could be loaded from > XML meta data files (of a known format) included in jar files. It should > also be possible to compose Hadoop job "chains" using XBaya. So, what we > specify in the application context would be the target application type > (say Hadoop), job/chain id, input file location and the output file > location. In addition I am thinking of having job monitoring support based > on constructs provided by the Hadoop API (that I have already looked into) > and data querying based on Apache Hive/Pig. > I think we have pretty much this functionality done in the similar way you are explaining. I have added the code in to trunk and will provide some test classes and will update the schedular to return the HadoopProvider. > > Furthermore, apart from Hadoop there are two other similar frameworks that > look quite promising. > > 1. Sector/Sphere > > Sector/Sphere [1] is an open source software framework for high-performance > distributed data storage and processing. It is comparable with Apache > HDFS/Hadoop. Sector is a distributed file system and Sphere is the > programming framework that supports massive in-storage parallel data > processing on data stored in Sector. The key motive is that Sector/Sphere > is claimed to be about 2 - 4 times faster than Hadoop. > > 2. Hyracks > > Hyracks [2] is another framework for data-intensive computing that is > roughly in the same space as Apache Hadoop. It has support for composing > and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks > runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3]. > > I am +1 to enable these API to enable to use other components, but do you think actual users would have a concern about the underneath library we use for Mapreduce jobs ? I am not quite confident about the way people are using these. But anyhow its nice to have a support for these. Regards Lahiru > I am yet to look into the API's of these two frameworks but they should > ideally work with the same GFac implementation that I have proposed for > Hadoop. > > I would strongly appreciate your feedback on this approach. Also pros and > cons of using Sector/Sphere or Hyracks if you have any experience with them > already. > > [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of > benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on > Many-Task Computing on Grids and Supercomputers, 2009, p. 3. > > [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A > flexible and extensible foundation for data-intensive computing,” in Data > Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp. > 1151–1162. > > [3] http://asterix.ics.uci.edu/ > > Thanks, > Danushka > -- System Analyst Programmer PTI Lab Indiana University
