Re: Spark Backend Support for Gora (GORA-386) Midterm Report

Furkan KAMACI Sun, 05 Jul 2015 06:36:06 -0700

Hi All,

It's been announced that I've passed the midterm evaluations! Beside my
mentors Lewis and Talat, I am waiting your comments and suggestions about
my project during the second part of GSoC. Thank you all again!


Kind Regards,
Furkan KAMACI
1 Tem 2015 10:35 tarihinde "Lewis John Mcgibbney" <[email protected]>
yazdı:

> This is fantastic.
> Needless to say the project will be progressing through mid term.
> Your blogging is very positive for dissemination of your work.
> Also like to extend a personal thank you to Talat. Excellent job and on
> behalf of the community here an exc potent effort to drive this GSOC
> project so far only half way through :).
> Looking forward to committing the initial patches into master branch and
> also your LogManagerSpark which will lower the barrier to adopting the
> module.
> Thanks
> Lewis
>
> On Wednesday, July 1, 2015, Furkan KAMACI <[email protected]> wrote:
>
>> Hi,
>>
>> First of all, I would like to thank all. As you know that I've been
>> accepted to GSoC 2015 with my proposal for developing a Spark Backend
>> Support for Gora (GORA-386) and it is the time for midterm evaluations. I
>> want to share my current progress of project and my midterm proposal as
>> well.
>>
>> During my GSoC period, I've blogged at my personal website (
>> http://furkankamaci.com/) and created a fork from Apache Gora's master
>> branch and worked on it: https://github.com/kamaci/gora
>>
>> At community bonding period, I've read Apache Gora documentation and
>> Apache Gora source code to be more familiar
>> with project. I've analyzed related projects including Apache Flink and
>> Apache Crunch to implement a Spark backend into Apache Gora. I've picked up
>> an issue from Jira (https://issues.apache.org/jira/browse/GORA-262) and
>> fixed.
>>
>> At coding period, due to implementing this project needs an
>> infrastructure about Apache Spark, I've started with analyzing Spark's
>> first papers. I've
>> analyzed “Spark: Cluster Computing with Working” (
>> http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf) and
>> “Resilient
>> Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster
>> Computing”
>> (https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf). I've
>> published two posts about Spark and Cluster Computing
>> (http://furkankamaci.com/spark-and-cluster-computing/) and Resilient
>> Distributed Datasets (
>> http://furkankamaci.com/resilient-distributed-datasets-rdds/) at my
>> personal blog. I've followed Apache Spark documentation and developed
>> examples to analyze RDDs.
>>
>> I've analyzed Apache Gora's GoraInputFormat class and Spark's
>> newHadoopRDD method. I've implemented an example application to read data
>> from Hbase.
>>
>> Apache Gora supports reading/writing data from/to Hadoop files. Spark has
>> a method for generating an RDD compatible with Hadoop files. So, an
>> architecture is designed which creates a bridge between GoraInputFormat and
>> RDD due to both of them support Hadoop files.
>>
>> I've created a base class for Apache Gora and Spark integration named
>> as:  GoraSparkEngine. It has initialize methods that takes Spark context,
>> data store, optional Hadoop configuration and returns an RDD.
>>
>> After implementing a base for GoraSpark engine, I've developed a new
>> example aligned to LogAnalytics named as:
>> LogAnalyticsSpark. I've developed map and reduce parts (except for
>> writing results into database) which does the same thing as
>> LogAnalytics and also something more i.e. printing number of lines in
>> tables.
>>
>> When we get an RDD from GoraSpark engine, we can do the operations over
>> it as like making operations on any other RDDs which is not created over
>> Apache Gora. Whole code can be checked from code base:
>> https://github.com/kamaci/gora
>>
>> Project progress is ahead from the proposed timeline up to now.
>> GoraInputFormat and RDD transformation is done and it is shown that map,
>> reduce and other methods can properly work on that kind of RDDs.
>>
>> Before the next steps, I am planning to design an overall architecture
>> according to feedbacks from community (there are some
>> prerequisites when designing an architecture: i.e. configuration of a
>> context at Spark cannot be changed after context has been initialized).
>>
>> When necessary functionalities are implemented examples, tests and
>> documentations will be done. After that if I have extra time, I'm planning
>> to make a performance benchmark of Apache Gora with Hadoop MapReduce,
>> Hadoop MapReduce, Apache Spark and Apache Gora with Spark as well.
>>
>> Special thanks to Lewis and Talat. I should also mention that it is a
>> real chance to be able to talk with your mentor face to face. We met with
>> Talat many times and he helped me a lot about how Hadoop and Apache Gora
>> works.
>>
>> PS: I've attached my midterm report and my previous reports can be found
>> here:
>>
>> https://cwiki.apache.org/confluence/display/GORA/Spark+Backend+Support+for+Gora+%28GORA-386%29+Reports
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>
>
> --
> *Lewis*
>
>

Re: Spark Backend Support for Gora (GORA-386) Midterm Report

Reply via email to