Hi Anirvan, welcome on the Flink mailing list.
Your examples are based on Hadoop-Streaming. I have to admit, I am not an expert in Hadoop-Streaming, but from my understanding it is basically a wrapper to run UDFs implement in a programming language other than Java. Flink does not support Hadoop Streaming at the moment. I don't know how it works under the hood but it might be that Hadoop-Streaming works once we have support to execute regular Hadoop jobs. However, I would not bet on or wait for that. There are efforts to add a Python interface for Flink, but this is still work in progress. I'd recommend to get started with the regular Java API. For starting with the Java API, I point you to the Java Quickstart on our website: --> http://flink.incubator.apache.org/docs/0.6-incubating/java_api_quickstart.html Let us know if you have further questions. Best, Fabian 2014-08-27 16:48 GMT+02:00 Anirvan BASU <[email protected]>: > Hello Artem, Fabian and Robert, > > Thank you for your valuable inputs. > > Here is what we are trying to do currently (scratching the surface): > > Consider typical Hadoop commands of the form: > hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -D > stream.num.map.output.key.fields=2 -D > mapred.text.key.partitioner.options=-1,1 -inputformat > org.apache.hadoop.mapred.TextInputFormat -input stocks/input/stocks.txt > -apache.hadoop.mapred.lib.KeyFieldBasedPartitioner > Here we run a hadoop job with some parameters and keys specified. > > hadoop jar /opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar -input > /ngrams/input -output /ngrams/output -mapper ./python/mapper.py -file > ./python/mapper.py -combiner ./python/reducer.py -reducer > ./python/reducer.py -file ./python/reducer.py -jobconf > stream.num.map.output.key.fields=3 -jobconf > stream.num.reduce.output.key.fields=3 -jobconf mapred.reduce.tasks=10 > Here we run a hadoop job with some parameters specified and using external > scripts (in Python - as you have developed a Python connector, I cite this > example) for mapper, combiner, reducer. > > The above are slightly complicated than "hello world" type wordcount > examples, ain't it? > > Hence, > We would like to know how to wrap and run the above jobs (if possible) > using flink. > In which level of compatibility would you classify the above jobs? > > If possible, could you help us by: > - showing how to "wrap" around such commands? > - what dependencies, sources, classes (to archetype in the pom.xml if > required to make a package) OR (to import if required to make a jar file)? > > Honestly, after drowning in all the information, we are trying to > understand the (flink) elephant as blind mice. We understand it has legs, > trunk, etc ... but don't understand where to start to mount and ride it. > > Happy to clarify again & again until we reach success. > > Thank you for your understanding and all your kind help, > Anirvan > > ----- Original Message ----- > > From: "Fabian Hueske" <[email protected]> > > To: [email protected] > > Sent: Wednesday, August 27, 2014 10:04:18 AM > > Subject: Re: [stratosphere-dev] Re: Project for GSoC > > > > Hi Anirvan, > > > > we have a JIRA that tracks the HadoopCompatibility feature: > > https://apache.org/jira/browse/FLINK-838 > > <https://issues.apache.org/jira/browse/FLINK-838> > > The basic mechanism is done, but has not been merged because we found a > > cleaner way to integrate the feature. > > > > There are different levels of Hadoop Compatibility and I did not fully > > understand which kind of compatibility you need. > > > > Conceptual Compatibility: > > Flink already offers UDF interfaces which are very similar to Hadoop's > Map, > > Combine, and Reduce functions (FlatMap, FlatCombine, GroupReduce). These > > interfaces are not source-code compatible, but porting the code should be > > trivial. This is already there. If you're fine with porting your code > from > > Hadoop to Flink (which should be a very small effort) you're good to go. > > > > Function-Level-Source Compatibility: > > Providing UDF wrappers to use Hadoop Map and Reduce functions in Flink > > programs is not difficult (since they are very similar to their Flink > > versions). This is not yet included in the codebase but could be added at > > rather short notice. We do already have interface compatibility for > Hadoop > > Input- and OutputFormats. So you can use these without changing the code. > > > > Job-Level-Source Compatibility: > > This is what we aimed for with the GSoC project. Here we want to support > > the execution of a full Hadoop MR Job in Flink without changing the code. > > However, this turned out to be a bit tricky if custom partitioner, sort, > > and grouping-comparators are used in the job. Adding this feature will > take > > a bit more time. > > > > Function- and Job-Level-Source compatibility will enable the use of > already > > existing Hadoop code. If you are implementing new analysis jobs anyways, > > I'd go for a Flink implementation which eases many things such as > secondary > > sort, unions of multiple inputs, etc. > > > > Cheers, > > Fabian > > > > > > > > > > 2014-08-26 11:29 GMT+02:00 Robert Metzger <[email protected]>: > > > > > Hi Anirvan, > > > > > > I'm forwarding this message to [email protected]. You > need to > > > send a (empty) message to [email protected] to > > > subscribe to the dev list. > > > The dev@ list is for discussions with the developers, planning etc. > The > > > [email protected] list is for user questions (for > example > > > troubles using the API, conceptual questions etc.) > > > I think the message below is more suited for the dev@ list, since its > > > basically a feature request. > > > > > > Regarding the names: We don't use Stratosphere anymore. Our codebase > has > > > been renamed to Flink and the "org.apache.flink" namespace. So ideally > this > > > confusion is finally out of the world. > > > > > > For those who want to have a look into the history of the message, see > the > > > Google Groups archive here: > > > https://groups.google.com/forum/#!topic/stratosphere-dev/qYvJRSoMYWQ > > > > > > ---------- Forwarded message ---------- > > > From: Nirvanesque <[email protected]> > > > Date: Tue, Aug 26, 2014 at 11:12 AM > > > Subject: [stratosphere-dev] Re: Project for GSoC > > > To: [email protected] > > > Cc: [email protected] > > > > > > > > > Hello Artem and mentors, > > > > > > First of all nice greetings from INRIA, France. > > > Hope you had an enjoyable experience in GSOC! > > > Thanks to Robert (rmetzger) for forwarding me here ... > > > > > > At INRIA, we are starting to adopt Stratosphere / Flink. > > > The top-level goal is to enhance performance in User Defined Functions > > > (UDFs) with long workflows using multiple M-R, by using the larger set > of > > > Second Order Functions (SOFs) in Stratosphere / Flink. > > > We will demonstrate this improvement by implementing some Use Cases for > > > business purposes. > > > For this purpose, we have chosen some customer analysis Use Cases using > > > weblogs and related data, for 2 companies (who appeared interested to > try > > > using Stratosphere / Flink ) > > > - a mobile phone app developer: http://www.tribeflame.com > > > - an anti-virus & Internet security software company: www.f-secure.com > > > I will be happy to share with you these Use Cases, if you are > interested. > > > Just ask me here. > > > > > > At present, we are typically in the profiles of Alice-Bob-Sam, as > described > > > in your GSoC proposal > > > < > > > > https://github.com/stratosphere/stratosphere/wiki/GSoC-2014-Project-Proposal-Draft-by-Artem-Tsikiridis > > > >. > > > :-) > > > Hadoop seems to be the starting square for the Stratosphere / Flink > > > journey. > > > Same is the situation with developers in the above 2 companies :-) > > > > > > Briefly, > > > We have installed and run some example programmes from Flink / > Stratosphere > > > (versions 0.5.2 and 0.6). We use a cluster (the grid5000 for our > Hadoop & > > > Stratosphere installations) > > > We have some good understanding of Hadoop and its use in Streaming and > > > Pipes in conjunction with scripting languages (Python & R specifically) > > > In the first phase, we would like to run some "Hadoop-like" jobs > (mainly > > > multiple M-R workflows) on Stratosphere, preferably with extensive > Java or > > > Scala programming. > > > I refer to your GSoC project map > > > < > > > > https://github.com/stratosphere/stratosphere/wiki/%5BGSoC-14%5D-A-Hadoop-abstraction-layer-for-Stratosphere-%28Project-Map-and-Notes%29 > > > > > > > which seems very interesting. > > > If we could have a Hadoop abstraction as you have mentioned, that > would be > > > ideal for our first phase. > > > In later phases, when we implement complex join and group operations, > we > > > would dive deeper into Stratosphere / Flink Java or Scala APIs > > > > > > Hence, I would like to know, what is the current status in this > direction? > > > What has been implemented already? In which version onwards? How to try > > > them? > > > What is yet to be implemented? When - which versions? > > > > > > You may also like to see my discussion with Robert on this page > > > < > > > > http://flink.incubator.apache.org/docs/0.6-incubating/cli.html#comment-1558297261 > > > >. > > > > > > I am still mining in different discussions - here as well as on JIRA. > > > Please do refer me to the relevant links, JIRA tickets, etc if that > saves > > > your time in re-typing large replies. > > > It will also help us to understand the train of collective thinking in > the > > > Stratosphere / Flink roadmap. > > > > > > Thanks in advance, > > > Anirvan > > > PS : Apologies for using names / rechristened names (e.g. Flink / > > > Stratosphere) as I am not sure, which name exactly to use currently. > > > > > > > > > On Tuesday, February 25, 2014 10:23:09 PM UTC+1, Artem Tsikiridis > wrote: > > > > > > > > Hello Fabian, > > > > > > > > On Tuesday, February 25, 2014 11:20:10 AM UTC+2, [email protected] > wrote: > > > > > Hi Artem, > > > > > > > > > > thanks a lot for your interest in Stratosphere and participating > in our > > > > GSoC projects! > > > > > > > > > > As you know, Hadoop is the big elephant out there in the Big Data > > > jungle > > > > and widely adopted. Therefore, a Hadoop compatibility layer is a > very! > > > > important feature for any large scale data processing system. > > > > > Stratosphere builds on foundations of MapReduce but generalizes its > > > > concepts and provides a more efficient runtime. > > > > > > > > Great! > > > > > > > > > When you have a look at the Stratosphere WordCount example > program, you > > > > will see, that the programming principles of Stratosphere and Hadoop > > > > MapReduce are quite similar, although Stratosphere is not compatible > with > > > > the Hadoop interfaces. > > > > > > > > Yes, I've looked into the example (Wordcount, k-means) I also run > the big > > > > test job you have locally and it seems to be ok. > > > > > > > > > With the proposed project we want to achieve, that Hadoop MapReduce > > > jobs > > > > can be executed on Stratosphere without changing a line of code (if > > > > possible). > > > > > > > > > > We have already some pieces for that in place. InputFormats are > done > > > > (see https://github.com/stratosphere/stratosphere/ > > > > tree/master/stratosphere-addons/hadoop-compatibility), OutputFormats > are > > > > work in progress. The biggest missing piece is executing Hadoop Map > and > > > > Reduce tasks in Stratosphere. Hadoop provides quite a few interfaces > > > (e.g., > > > > overwriting partitioning function and sorting comparators, counters, > > > > distributed cache, ...). It would of course be desirable to support > as > > > many > > > > of these interfaces as possible, but they can by added step-by-step > once > > > > the first Hadoop jobs are running on Stratosphere. > > > > > > > > So If I understand correctly, the idea is to create logical wrappers > for > > > > all interfaces used by Hadoop Jobs (the way it has been done with the > > > > hadoop datatypes) so it can be run as completely transparently as > > > possible > > > > on Stratosphere in an efficient way. I agree, there are many > interfaces, > > > > but it's very interesting considering the way Stratosphere defines > tasks, > > > > which is a bit different (though, as you said, the principle is > similar). > > > > > > > > I assume the focus is on the YARN version of Hadoop (new api)? > > > > > > > > And one last question, serialization for Stratosphere is java's > default > > > > mechanism, right? > > > > > > > > > > > > > > Regarding your question about cloud deployment scripts, one of our > team > > > > members is currently working on this (see this thread: > > > > https://groups.google.com/forum/#!topic/stratosphere-dev/QZPYu9fpjMo > ). > > > > > I am not sure, if this is still in the making or already done. If > you > > > > are interested in this as well, just drop a line to the thread. > > > Although, I > > > > am not very familiar with the detail of this, my gut feeling is that > this > > > > would be a bit too less for an individual project. However, there > might > > > be > > > > ways to extend this. So if you have any ideas, share them with us > and we > > > > will be happy to discuss them. > > > > > > > > Thank you for pointing up the topic. I will let you know if I come up > > > with > > > > anything for this. Probably after I try deploying it on openstack. > > > > > > > > > > > > > > Again, thanks a lot for your interest and please don't hesitate to > ask > > > > questions. :-) > > > > > > > > Thank you for the helpful answers. > > > > > > > > Kind regards, > > > > Artem > > > > > > > > > > > > > > > > > > Best, > > > > > Fabian > > > > > > > > > > > > > > > On Tuesday, February 25, 2014 9:12:10 AM UTC+1, [email protected] > > > > wrote: > > > > > Dear Stratosphere devs and fellow GSoC potential students, > > > > > Hello! > > > > > I'm Artem, an undergraduate student from Athens, Greece. You can > find > > > me > > > > on github (https://github.com/atsikiridis) and occasionally on > > > > stackoverflow ( > http://stackoverflow.com/users/2568511/artem-tsikiridis). > > > > Currently, however, I'm in Switzerland where I am doing my > internship at > > > > CERN as back-end software developer for INSPIRE, a library for High > > > Energy > > > > Physics (we're running on http://inspirehep.net/). The service is in > > > > python( based on the open-source project http://invenio.net) and my > > > > responsibilities are mostly the integration with Redis, database > > > > abstractions, testing (unit, regression) and helping > > > > > our team to integrate modern technologies and frameworks to the > current > > > > code base. > > > > > Moreover, I am very interested in big data technologies, therefore > > > > before coming to CERN I've been trying to make my first steps in > > > research > > > > at the Big Data lab of AUEB, my home university. Mostly, the main > > > objective > > > > of the project I had been involved with, was the implementation of a > > > > dynamic caching mechanism for Hadoop (in a way trying our cache > instead > > > of > > > > the built-in distributed cache). Other techs involved where Redis, > > > > Memcached, Ehacache (Terracotta). With this project we gained some > > > insights > > > > about the internals of hadoop (new api. old api, how tasks work, > hadoop > > > > serialization, the daemons running etc.) and hdfs, deployed clusters > on > > > > cloud computing platforms (Openstack with Nova, Amazon EC2 with > boto). > > > We > > > > also used the Java Remote API for some tests. > > > > > Unfortunately, I have not used Stratosphere before in a research > /prod > > > > environment. I have only played with the examples on my local > machine. It > > > > is very interesting and I would love to learn more. > > > > > There will probably be a learning curve for me on the Stratosphere > side > > > > but implementing a Hadoop Compatibility Layer seems like a very > > > interesting > > > > project and I believe I can be of use :) > > > > > Finally, I was wondering whether there are some command-line tools > for > > > > deploying Stratosphere automatically for EC2 or Openstack clouds (for > > > > example, Stratosphere specific abstractions on top of python boto > api). > > > Do > > > > you that would make sense as a project? > > > > > Pardon me for the length of this. > > > > > Kind regards, > > > > > Artem > > > > > > -- > > > You received this message because you are subscribed to the Google > Groups > > > "stratosphere-dev" group. > > > To unsubscribe from this group and stop receiving emails from it, send > an > > > email to [email protected]. > > > Visit this group at http://groups.google.com/group/stratosphere-dev. > > > For more options, visit https://groups.google.com/d/optout. > > > > > >
