Fwd: [stratosphere-dev] Re: Project for GSoC

Robert Metzger Tue, 26 Aug 2014 02:30:25 -0700

Hi Anirvan,

I'm forwarding this message to [email protected]. You need to
send a (empty) message to [email protected] to
subscribe to the dev list.
The dev@ list is for discussions with the developers, planning etc. The
[email protected] list is for user questions (for example
troubles using the API, conceptual questions etc.)
I think the message below is more suited for the dev@ list, since its
basically a feature request.


Regarding the names: We don't use Stratosphere anymore. Our codebase has
been renamed to Flink and the "org.apache.flink" namespace. So ideally this
confusion is finally out of the world.

For those who want to have a look into the history of the message, see the
Google Groups archive here:
https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ

---------- Forwarded message ----------
From: Nirvanesque <[email protected]>
Date: Tue, Aug 26, 2014 at 11:12 AM
Subject: [stratosphere-dev] Re: Project for GSoC
To: [email protected]
Cc: [email protected]


Hello Artem and mentors,

First of all nice greetings from INRIA, France.
Hope you had an enjoyable experience in GSOC!
Thanks to Robert (rmetzger) for forwarding me here ...

At INRIA, we are starting to adopt Stratosphere / Flink.
The top-level goal is to enhance performance in User Defined Functions
(UDFs) with long workflows using multiple M-R, by using the larger set of
Second Order Functions (SOFs) in Stratosphere / Flink.
We will demonstrate this improvement by implementing some Use Cases for
business purposes.
For this purpose, we have chosen some customer analysis Use Cases using
weblogs and related data, for 2 companies (who appeared interested to try
using Stratosphere / Flink )
- a mobile phone app developer: http://www.tribeflame.com
- an anti-virus & Internet security software company: www.f-secure.com
I will be happy to share with you these Use Cases, if you are interested.
Just ask me here.

At present, we are typically in the profiles of Alice-Bob-Sam, as described
in your GSoC proposal
<https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis>.
:-)
Hadoop seems to be the starting square for the Stratosphere / Flink journey.
Same is the situation with developers in the above 2 companies :-)

Briefly,
We have installed and run some example programmes from Flink / Stratosphere
(versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our Hadoop &
Stratosphere installations)
We have some good understanding of Hadoop and its use in Streaming and
Pipes in conjunction with scripting languages (Python & R specifically)
In the first phase, we would like to run some "Hadoop-like" jobs (mainly
multiple M-R workflows) on Stratosphere, preferably with extensive Java or
Scala programming.
I refer to your GSoC project map
<https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29>
which seems very interesting.
If we could have a Hadoop abstraction as you have mentioned, that would be
ideal for our first phase.
In later phases, when we implement complex join and group operations, we
would dive deeper into Stratosphere / Flink Java or Scala APIs

Hence, I would like to know, what is the current status in this direction?
What has been implemented already? In which version onwards? How to try
them?
What is yet to be implemented? When - which versions?

You may also like to see my discussion with Robert on this page
<http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261>.

I am still mining in different discussions - here as well as on JIRA.
Please do refer me to the relevant links, JIRA tickets, etc if that saves
your time in re-typing large replies.
It will also help us to understand the train of collective thinking in the
Stratosphere / Flink roadmap.

Thanks in advance,
Anirvan
PS : Apologies for using names / rechristened names (e.g. Flink /
Stratosphere) as I am not sure, which name exactly to use currently.


On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis wrote:
>
> Hello Fabian,
>
> On Tuesday, February 25, 2014 11:20:10 AM UTC+2, [email protected] wrote:
> > Hi Artem,
> >
> > thanks a lot for your interest in Stratosphere and participating in our
> GSoC projects!
> >
> > As you know, Hadoop is the big elephant out there in the Big Data jungle
> and widely adopted. Therefore, a Hadoop compatibility layer is a very!
> important feature for any large scale data processing system.
> > Stratosphere builds on foundations of MapReduce but generalizes its
> concepts and provides a more efficient runtime.
>
> Great!
>
> > When you have a look at the Stratosphere WordCount example program, you
> will see, that the programming principles of Stratosphere and Hadoop
> MapReduce are quite similar, although Stratosphere is not compatible with
> the Hadoop interfaces.
>
> Yes, I've looked into the example (Wordcount, k-means) I also run the big
> test job you have locally and it seems to be ok.
>
> > With the proposed project we want to achieve, that Hadoop MapReduce jobs
> can be executed on Stratosphere without changing a line of code (if
> possible).
> >
> > We have already some pieces for that in place. InputFormats are done
> (see https://github.com/stratosphere/stratosphere/
> tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats are
> work in progress. The biggest missing piece is executing Hadoop Map and
> Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces (e.g.,
> overwriting partitioning function and sorting comparators, counters,
> distributed cache, ...). It would of course be desirable to support as many
> of these interfaces as possible, but they can by added step-by-step once
> the first Hadoop jobs are running on Stratosphere.
>
> So If I understand correctly, the idea is to create logical wrappers for
> all interfaces used by Hadoop Jobs (the way it has been done with the
> hadoop datatypes) so it can be run as completely transparently as possible
> on Stratosphere in an efficient way. I agree, there are many interfaces,
> but it's very interesting considering the way Stratosphere defines tasks,
> which is a bit different (though, as you said, the principle is similar).
>
> I assume the focus is on the YARN version of Hadoop (new api)?
>
> And one last question, serialization for Stratosphere is java's default
> mechanism, right?
>
> >
> > Regarding your question about cloud deployment scripts, one of our team
> members is currently working on this (see this thread:
> https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo).
> > I am not sure, if this is still in the making or already done. If you
> are interested in this as well, just drop a line to the thread. Although, I
> am not very familiar with the detail of this, my gut feeling is that this
> would be a bit too less for an individual project. However, there might be
> ways to extend this. So if you have any ideas, share them with us and we
> will be happy to discuss them.
>
> Thank you for pointing up the topic. I will let you know if I come up with
> anything for this. Probably after I try deploying it on openstack.
>
> >
> > Again, thanks a lot for your interest and please don't hesitate to ask
> questions. :-)
>
> Thank you for the helpful answers.
>
> Kind regards,
> Artem
>
>
> >
> > Best,
> > Fabian
> >
> >
> > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, [email protected]
> wrote:
> > Dear Stratosphere devs and fellow GSoC potential students,
> > Hello!
> > I'm Artem, an undergraduate student from Athens, Greece. You can find me
> on github (https://github.com/atsikiridis) and occasionally on
> stackoverflow (http://stackoverflow.com/users/2568511/artem-tsikiridis).
> Currently, however, I'm in Switzerland where I am doing my internship at
> CERN as back-end software developer for INSPIRE, a library for High Energy
> Physics (we're running on http://inspirehep.net/). The service is in
> python( based on the open-source project http://invenio.net) and my
> responsibilities are mostly the integration with Redis, database
> abstractions, testing (unit, regression) and helping
> > our team to integrate modern technologies and frameworks to the current
> code base.
> > Moreover, I am very interested in big data technologies, therefore
> before coming to CERN I've been trying  to make my first steps in research
> at the Big Data lab of AUEB, my home university. Mostly, the main objective
> of the project I had been involved with, was the implementation of a
> dynamic caching mechanism for Hadoop (in a way trying our cache instead of
> the built-in distributed cache). Other techs involved where Redis,
> Memcached, Ehacache (Terracotta). With this project we gained some insights
> about the internals of hadoop (new api. old api, how tasks work, hadoop
> serialization, the daemons running etc.) and hdfs, deployed clusters on
> cloud computing platforms (Openstack with Nova,  Amazon EC2 with boto). We
> also used the Java Remote API for some tests.
> > Unfortunately, I have not used Stratosphere before in a research /prod
> environment. I have only played with the examples on my local machine. It
> is very interesting and I would love to learn more.
> > There will probably be a learning curve for me on the Stratosphere side
> but implementing a Hadoop Compatibility Layer seems like a very interesting
> project and I believe I can be of use :)
> > Finally, I was wondering whether there are some command-line tools for
> deploying Stratosphere automatically for EC2 or Openstack clouds (for
> example, Stratosphere specific abstractions on top of python boto api). Do
> you that would make sense as a project?
> > Pardon me for the length of this.
> > Kind regards,
> > Artem

 --
You received this message because you are subscribed to the Google Groups
"stratosphere-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
Visit this group at http://groups.google.com/group/stratosphere-dev.
For more options, visit https://groups.google.com/d/optout.

Fwd: [stratosphere-dev] Re: Project for GSoC

Reply via email to