Re: [stratosphere-dev] Re: Project for GSoC

Fabian Hueske Wed, 27 Aug 2014 08:08:51 -0700

Hi Anirvan,

welcome on the Flink mailing list.


Your examples are based on Hadoop-Streaming. I have to admit, I am not an
expert in Hadoop-Streaming, but from my understanding it is basically a
wrapper to run UDFs implement in a programming language other than Java.
Flink does not support Hadoop Streaming at the moment. I don't know how it
works under the hood but it might be that Hadoop-Streaming works once we
have support to execute regular Hadoop jobs. However, I would not bet on or
wait for that. There are efforts to add a Python interface for Flink, but
this is still work in progress.

I'd recommend to get started with the regular Java API.
For starting with the Java API, I point you to the Java Quickstart on our
website:
-->
http://flink.incubator.apache.org/docs/0.6-incubating/java_api_quickstart.html

Let us know if you have further questions.

Best, Fabian




2014-08-27 16:48 GMT+02:00 Anirvan BASU <[email protected]>:

> Hello Artem, Fabian and Robert,
>
> Thank you for your valuable inputs.
>
> Here is what we are trying to do currently (scratching the surface):
>
> Consider typical Hadoop commands of the form:
> hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -D
> stream.num.map.output.key.fields=2 -D
> mapred.text.key.partitioner.options=-1,1 -inputformat
> org.apache.hadoop.mapred.TextInputFormat -input stocks/input/stocks.txt
> -apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
> Here we run a hadoop job with some parameters and keys specified.
>
> hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -input
> /ngrams/input -output /ngrams/output -mapper ./python/mapper.py -file
> ./python/mapper.py -combiner ./python/reducer.py -reducer
> ./python/reducer.py -file ./python/reducer.py -jobconf
> stream.num.map.output.key.fields=3 -jobconf
> stream.num.reduce.output.key.fields=3 -jobconf mapred.reduce.tasks=10
> Here we run a hadoop job with some parameters specified and using external
> scripts (in Python - as you have developed a Python connector, I cite this
> example) for mapper, combiner, reducer.
>
> The above are slightly complicated than "hello world" type wordcount
> examples, ain't it?
>
> Hence,
> We would like to know how to wrap and run the above jobs (if possible)
> using flink.
> In which level of compatibility would you classify the above jobs?
>
> If possible, could you help us by:
> - showing how to "wrap" around such commands?
> - what dependencies, sources, classes (to archetype in the pom.xml if
> required to make a package) OR (to import if required to make a jar file)?
>
> Honestly, after drowning in all the information, we are trying to
> understand the (flink) elephant as blind mice. We understand it has legs,
> trunk, etc ... but don't understand where to start to mount and ride it.
>
> Happy to clarify again & again until we reach success.
>
> Thank you for your understanding and all your kind help,
> Anirvan
>
> ----- Original Message -----
> > From: "Fabian Hueske" <[email protected]>
> > To: [email protected]
> > Sent: Wednesday, August 27, 2014 10:04:18 AM
> > Subject: Re: [stratosphere-dev] Re: Project for GSoC
> >
> > Hi Anirvan,
> >
> > we have a JIRA that tracks the HadoopCompatibility feature:
> > https://apache.org/jira/browse/FLINK-838
> > <https://issues.apache.org/jira/browse/FLINK-838>
> > The basic mechanism is done, but has not been merged because we found a
> > cleaner way to integrate the feature.
> >
> > There are different levels of Hadoop Compatibility and I did not fully
> > understand which kind of compatibility you need.
> >
> > Conceptual Compatibility:
> > Flink already offers UDF interfaces which are very similar to Hadoop's
> Map,
> > Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce). These
> > interfaces are not source-code compatible, but porting the code should be
> > trivial. This is already there. If you're fine with porting your code
> from
> > Hadoop to Flink (which should be a very small effort) you're good to go.
> >
> > Function-Level-Source Compatibility:
> > Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink
> > programs is not difficult (since they are very similar to their Flink
> > versions). This is not yet included in the codebase but could be added at
> > rather short notice. We do already have interface compatibility for
> Hadoop
> > Input- and OutputFormats. So you can use these without changing the code.
> >
> > Job-Level-Source Compatibility:
> > This is what we aimed for with the GSoC project. Here we want to support
> > the execution of a full Hadoop MR Job in Flink without changing the code.
> > However, this turned out to be a bit tricky if custom partitioner, sort,
> > and grouping-comparators are used in the job. Adding this feature will
> take
> > a bit more time.
> >
> > Function- and Job-Level-Source compatibility will enable the use of
> already
> > existing Hadoop code. If you are implementing new analysis jobs anyways,
> > I'd go for a Flink implementation which eases many things such as
> secondary
> > sort, unions of multiple inputs, etc.
> >
> > Cheers,
> > Fabian
> >
> >
> >
> >
> > 2014-08-26 11:29 GMT+02:00 Robert Metzger <[email protected]>:
> >
> > > Hi Anirvan,
> > >
> > > I'm forwarding this message to [email protected]. You
> need to
> > > send a (empty) message to [email protected] to
> > > subscribe to the dev list.
> > > The dev@ list is for discussions with the developers, planning etc.
> The
> > > [email protected] list is for user questions (for
> example
> > > troubles using the API, conceptual questions etc.)
> > > I think the message below is more suited for the dev@ list, since its
> > > basically a feature request.
> > >
> > > Regarding the names: We don't use Stratosphere anymore. Our codebase
> has
> > > been renamed to Flink and the "org.apache.flink" namespace. So ideally
> this
> > > confusion is finally out of the world.
> > >
> > > For those who want to have a look into the history of the message, see
> the
> > > Google Groups archive here:
> > > https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ
> > >
> > > ---------- Forwarded message ----------
> > > From: Nirvanesque <[email protected]>
> > > Date: Tue, Aug 26, 2014 at 11:12 AM
> > > Subject: [stratosphere-dev] Re: Project for GSoC
> > > To: [email protected]
> > > Cc: [email protected]
> > >
> > >
> > > Hello Artem and mentors,
> > >
> > > First of all nice greetings from INRIA, France.
> > > Hope you had an enjoyable experience in GSOC!
> > > Thanks to Robert (rmetzger) for forwarding me here ...
> > >
> > > At INRIA, we are starting to adopt Stratosphere / Flink.
> > > The top-level goal is to enhance performance in User Defined Functions
> > > (UDFs) with long workflows using multiple M-R, by using the larger set
> of
> > > Second Order Functions (SOFs) in Stratosphere / Flink.
> > > We will demonstrate this improvement by implementing some Use Cases for
> > > business purposes.
> > > For this purpose, we have chosen some customer analysis Use Cases using
> > > weblogs and related data, for 2 companies (who appeared interested to
> try
> > > using Stratosphere / Flink )
> > > - a mobile phone app developer: http://www.tribeflame.com
> > > - an anti-virus & Internet security software company: www.f-secure.com
> > > I will be happy to share with you these Use Cases, if you are
> interested.
> > > Just ask me here.
> > >
> > > At present, we are typically in the profiles of Alice-Bob-Sam, as
> described
> > > in your GSoC proposal
> > > <
> > >
> https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis
> > > >.
> > > :-)
> > > Hadoop seems to be the starting square for the Stratosphere / Flink
> > > journey.
> > > Same is the situation with developers in the above 2 companies :-)
> > >
> > > Briefly,
> > > We have installed and run some example programmes from Flink /
> Stratosphere
> > > (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our
> Hadoop &
> > > Stratosphere installations)
> > > We have some good understanding of Hadoop and its use in Streaming and
> > > Pipes in conjunction with scripting languages (Python & R specifically)
> > > In the first phase, we would like to run some "Hadoop-like" jobs
> (mainly
> > > multiple M-R workflows) on Stratosphere, preferably with extensive
> Java or
> > > Scala programming.
> > > I refer to your GSoC project map
> > > <
> > >
> https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29
> > > >
> > > which seems very interesting.
> > > If we could have a Hadoop abstraction as you have mentioned, that
> would be
> > > ideal for our first phase.
> > > In later phases, when we implement complex join and group operations,
> we
> > > would dive deeper into Stratosphere / Flink Java or Scala APIs
> > >
> > > Hence, I would like to know, what is the current status in this
> direction?
> > > What has been implemented already? In which version onwards? How to try
> > > them?
> > > What is yet to be implemented? When - which versions?
> > >
> > > You may also like to see my discussion with Robert on this page
> > > <
> > >
> http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261
> > > >.
> > >
> > > I am still mining in different discussions - here as well as on JIRA.
> > > Please do refer me to the relevant links, JIRA tickets, etc if that
> saves
> > > your time in re-typing large replies.
> > > It will also help us to understand the train of collective thinking in
> the
> > > Stratosphere / Flink roadmap.
> > >
> > > Thanks in advance,
> > > Anirvan
> > > PS : Apologies for using names / rechristened names (e.g. Flink /
> > > Stratosphere) as I am not sure, which name exactly to use currently.
> > >
> > >
> > > On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis
> wrote:
> > > >
> > > > Hello Fabian,
> > > >
> > > > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, [email protected]
> wrote:
> > > > > Hi Artem,
> > > > >
> > > > > thanks a lot for your interest in Stratosphere and participating
> in our
> > > > GSoC projects!
> > > > >
> > > > > As you know, Hadoop is the big elephant out there in the Big Data
> > > jungle
> > > > and widely adopted. Therefore, a Hadoop compatibility layer is a
> very!
> > > > important feature for any large scale data processing system.
> > > > > Stratosphere builds on foundations of MapReduce but generalizes its
> > > > concepts and provides a more efficient runtime.
> > > >
> > > > Great!
> > > >
> > > > > When you have a look at the Stratosphere WordCount example
> program, you
> > > > will see, that the programming principles of Stratosphere and Hadoop
> > > > MapReduce are quite similar, although Stratosphere is not compatible
> with
> > > > the Hadoop interfaces.
> > > >
> > > > Yes, I've looked into the example (Wordcount, k-means) I also run
> the big
> > > > test job you have locally and it seems to be ok.
> > > >
> > > > > With the proposed project we want to achieve, that Hadoop MapReduce
> > > jobs
> > > > can be executed on Stratosphere without changing a line of code (if
> > > > possible).
> > > > >
> > > > > We have already some pieces for that in place. InputFormats are
> done
> > > > (see https://github.com/stratosphere/stratosphere/
> > > > tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats
> are
> > > > work in progress. The biggest missing piece is executing Hadoop Map
> and
> > > > Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces
> > > (e.g.,
> > > > overwriting partitioning function and sorting comparators, counters,
> > > > distributed cache, ...). It would of course be desirable to support
> as
> > > many
> > > > of these interfaces as possible, but they can by added step-by-step
> once
> > > > the first Hadoop jobs are running on Stratosphere.
> > > >
> > > > So If I understand correctly, the idea is to create logical wrappers
> for
> > > > all interfaces used by Hadoop Jobs (the way it has been done with the
> > > > hadoop datatypes) so it can be run as completely transparently as
> > > possible
> > > > on Stratosphere in an efficient way. I agree, there are many
> interfaces,
> > > > but it's very interesting considering the way Stratosphere defines
> tasks,
> > > > which is a bit different (though, as you said, the principle is
> similar).
> > > >
> > > > I assume the focus is on the YARN version of Hadoop (new api)?
> > > >
> > > > And one last question, serialization for Stratosphere is java's
> default
> > > > mechanism, right?
> > > >
> > > > >
> > > > > Regarding your question about cloud deployment scripts, one of our
> team
> > > > members is currently working on this (see this thread:
> > > > https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo
> ).
> > > > > I am not sure, if this is still in the making or already done. If
> you
> > > > are interested in this as well, just drop a line to the thread.
> > > Although, I
> > > > am not very familiar with the detail of this, my gut feeling is that
> this
> > > > would be a bit too less for an individual project. However, there
> might
> > > be
> > > > ways to extend this. So if you have any ideas, share them with us
> and we
> > > > will be happy to discuss them.
> > > >
> > > > Thank you for pointing up the topic. I will let you know if I come up
> > > with
> > > > anything for this. Probably after I try deploying it on openstack.
> > > >
> > > > >
> > > > > Again, thanks a lot for your interest and please don't hesitate to
> ask
> > > > questions. :-)
> > > >
> > > > Thank you for the helpful answers.
> > > >
> > > > Kind regards,
> > > > Artem
> > > >
> > > >
> > > > >
> > > > > Best,
> > > > > Fabian
> > > > >
> > > > >
> > > > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, [email protected]
> > > > wrote:
> > > > > Dear Stratosphere devs and fellow GSoC potential students,
> > > > > Hello!
> > > > > I'm Artem, an undergraduate student from Athens, Greece. You can
> find
> > > me
> > > > on github (https://github.com/atsikiridis) and occasionally on
> > > > stackoverflow (
> http://stackoverflow.com/users/2568511/artem-tsikiridis).
> > > > Currently, however, I'm in Switzerland where I am doing my
> internship at
> > > > CERN as back-end software developer for INSPIRE, a library for High
> > > Energy
> > > > Physics (we're running on http://inspirehep.net/). The service is in
> > > > python( based on the open-source project http://invenio.net) and my
> > > > responsibilities are mostly the integration with Redis, database
> > > > abstractions, testing (unit, regression) and helping
> > > > > our team to integrate modern technologies and frameworks to the
> current
> > > > code base.
> > > > > Moreover, I am very interested in big data technologies, therefore
> > > > before coming to CERN I've been trying  to make my first steps in
> > > research
> > > > at the Big Data lab of AUEB, my home university. Mostly, the main
> > > objective
> > > > of the project I had been involved with, was the implementation of a
> > > > dynamic caching mechanism for Hadoop (in a way trying our cache
> instead
> > > of
> > > > the built-in distributed cache). Other techs involved where Redis,
> > > > Memcached, Ehacache (Terracotta). With this project we gained some
> > > insights
> > > > about the internals of hadoop (new api. old api, how tasks work,
> hadoop
> > > > serialization, the daemons running etc.) and hdfs, deployed clusters
> on
> > > > cloud computing platforms (Openstack with Nova,  Amazon EC2 with
> boto).
> > > We
> > > > also used the Java Remote API for some tests.
> > > > > Unfortunately, I have not used Stratosphere before in a research
> /prod
> > > > environment. I have only played with the examples on my local
> machine. It
> > > > is very interesting and I would love to learn more.
> > > > > There will probably be a learning curve for me on the Stratosphere
> side
> > > > but implementing a Hadoop Compatibility Layer seems like a very
> > > interesting
> > > > project and I believe I can be of use :)
> > > > > Finally, I was wondering whether there are some command-line tools
> for
> > > > deploying Stratosphere automatically for EC2 or Openstack clouds (for
> > > > example, Stratosphere specific abstractions on top of python boto
> api).
> > > Do
> > > > you that would make sense as a project?
> > > > > Pardon me for the length of this.
> > > > > Kind regards,
> > > > > Artem
> > >
> > >  --
> > > You received this message because you are subscribed to the Google
> Groups
> > > "stratosphere-dev" group.
> > > To unsubscribe from this group and stop receiving emails from it, send
> an
> > > email to [email protected].
> > > Visit this group at http://groups.google.com/group/stratosphere-dev.
> > > For more options, visit https://groups.google.com/d/optout.
> > >
> >
>

Re: [stratosphere-dev] Re: Project for GSoC

Reply via email to