On Mon, Feb 25, 2013 at 9:59 PM, Danushka Menikkumbura <[email protected]> wrote: > Also, I suggest we have a simple plug-in architecture for providers that > would make having custom providers possible.
Hi Dhanushka, I guess the plugin mechanism for providers is already in-place with new GFac architecture. Lahiru will be able to give more information about this. Thanks Amila > > Thanks, > Danushka > > > On Tue, Feb 26, 2013 at 3:18 AM, Danushka Menikkumbura < > [email protected]> wrote: > >> Hi Devs, >> >> I am looking into extending Big Data capabilities of Airavata as my M.Sc. >> research work. I have identified certain possibilities and am going to >> start with integrating Apache Hadoop (and Hadoop-like frameworks) with >> Airavata. >> >> According to what I have understood, the best approach would be to have a >> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can >> have a new parameter in the ApplicationContext (say TargetApplication) to >> define the target application type and resolve correct provider in the GFac >> Scheduler based on that. I see that having this capability in the Scheduler >> class is already a TODO. I have been able to do these changes locally and >> invoke a simple Hadoop job using GFac. Thus, I can assure that this >> approach is viable except for any other implication that I am missing. >> >> I think we can store Hadoop job definitions in the Airavata Registry where >> each definition would essentially include a unique identifier and other >> attributes like mapper, reducer, sorter, formaters, etc that can be defined >> using XBaya. Information about these building blocks could be loaded from >> XML meta data files (of a known format) included in jar files. It should >> also be possible to compose Hadoop job "chains" using XBaya. So, what we >> specify in the application context would be the target application type >> (say Hadoop), job/chain id, input file location and the output file >> location. In addition I am thinking of having job monitoring support based >> on constructs provided by the Hadoop API (that I have already looked into) >> and data querying based on Apache Hive/Pig. >> >> Furthermore, apart from Hadoop there are two other similar frameworks that >> look quite promising. >> >> 1. Sector/Sphere >> >> Sector/Sphere [1] is an open source software framework for >> high-performance distributed data storage and processing. It is comparable >> with Apache HDFS/Hadoop. Sector is a distributed file system and Sphere is >> the programming framework that supports massive in-storage parallel data >> processing on data stored in Sector. The key motive is that Sector/Sphere >> is claimed to be about 2 - 4 times faster than Hadoop. >> >> 2. Hyracks >> >> Hyracks [2] is another framework for data-intensive computing that is >> roughly in the same space as Apache Hadoop. It has support for composing >> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks >> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3]. >> >> I am yet to look into the API's of these two frameworks but they should >> ideally work with the same GFac implementation that I have proposed for >> Hadoop. >> >> I would strongly appreciate your feedback on this approach. Also pros and >> cons of using Sector/Sphere or Hyracks if you have any experience with them >> already. >> >> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of >> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on >> Many-Task Computing on Grids and Supercomputers, 2009, p. 3. >> >> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A >> flexible and extensible foundation for data-intensive computing,” in Data >> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp. >> 1151–1162. >> >> [3] http://asterix.ics.uci.edu/ >> >> Thanks, >> Danushka >>
