Re: Airavata/Hadoop Integration

Amila Jayasekara Mon, 25 Feb 2013 19:17:00 -0800

On Mon, Feb 25, 2013 at 9:59 PM, Danushka Menikkumbura
<[email protected]> wrote:
> Also, I suggest we have a simple plug-in architecture for providers that
> would make having custom providers possible.


Hi Dhanushka,

I guess the plugin mechanism for providers is already in-place with
new GFac architecture. Lahiru will be able to give more information
about this.

Thanks
Amila

>
> Thanks,
> Danushka
>
>
> On Tue, Feb 26, 2013 at 3:18 AM, Danushka Menikkumbura <
> [email protected]> wrote:
>
>> Hi Devs,
>>
>> I am looking into extending Big Data capabilities of Airavata as my M.Sc.
>> research work. I have identified certain possibilities and am going to
>> start with integrating Apache Hadoop (and Hadoop-like frameworks) with
>> Airavata.
>>
>> According to what I have understood, the best approach would be to have a
>> new GFacProvider for Hadoop that takes care of handing Hadoop jobs. We can
>> have a new parameter in the ApplicationContext (say TargetApplication) to
>> define the target application type and resolve correct provider in the GFac
>> Scheduler based on that. I see that having this capability in the Scheduler
>> class is already a TODO. I have been able to do these changes locally and
>> invoke a simple Hadoop job using GFac. Thus, I can assure that this
>> approach is viable except for any other implication that I am missing.
>>
>> I think we can store Hadoop job definitions in the Airavata Registry where
>> each definition would essentially include a unique identifier and other
>> attributes like mapper, reducer, sorter, formaters, etc that can be defined
>> using XBaya. Information about these building blocks could be loaded from
>> XML meta data files (of a known format) included in jar files. It should
>> also be possible to compose Hadoop job "chains" using XBaya. So, what we
>> specify in the application context would be the target application type
>> (say Hadoop), job/chain id, input file location and the output file
>> location. In addition I am thinking of having job monitoring support based
>> on constructs provided by the Hadoop API (that I have already looked into)
>> and data querying based on Apache Hive/Pig.
>>
>> Furthermore, apart from Hadoop there are two other similar frameworks that
>> look quite promising.
>>
>> 1. Sector/Sphere
>>
>> Sector/Sphere [1] is an open source software framework for
>> high-performance distributed data storage and processing. It is comparable
>> with Apache HDFS/Hadoop. Sector is a distributed file system and Sphere is
>> the programming framework that supports massive in-storage parallel data
>> processing on data stored in Sector. The key motive is that Sector/Sphere
>> is claimed to be about 2 - 4 times faster than Hadoop.
>>
>> 2. Hyracks
>>
>> Hyracks [2] is another framework for data-intensive computing that is
>> roughly in the same space as Apache Hadoop. It has support for composing
>> and executing native Hyracks jobs plus running Hadoop jobs in the Hyracks
>> runtime. Furthermore, it powers the popular parallel DBMS, ASTERIX [3].
>>
>> I am yet to look into the API's of these two frameworks but they should
>> ideally work with the same GFac implementation that I have proposed for
>> Hadoop.
>>
>> I would strongly appreciate your feedback on this approach. Also pros and
>> cons of using Sector/Sphere or Hyracks if you have any experience with them
>> already.
>>
>> [1] Y. Gu and R. L. Grossman, “Lessons learned from a year’s worth of
>> benchmarks of large data clouds,” in Proceedings of the 2nd Workshop on
>> Many-Task Computing on Grids and Supercomputers, 2009, p. 3.
>>
>> [2] V. Borkar, M. Carey, R. Grover, N. Onose, and R. Vernica, “Hyracks: A
>> flexible and extensible foundation for data-intensive computing,” in Data
>> Engineering (ICDE), 2011 IEEE 27th International Conference on, 2011, pp.
>> 1151–1162.
>>
>> [3] http://asterix.ics.uci.edu/
>>
>> Thanks,
>> Danushka
>>

Re: Airavata/Hadoop Integration

Reply via email to