Hi Edward, that sounds very interesting! Let us know if you have any problems setting up or configuring Flink. We'll be very happy to help.
Cheers, Fabian 2014-08-29 4:30 GMT+02:00 Edward J. Yoon <[email protected]>: > Cool! > > >>> Very nice indeed! How well is this tested? Can it already run all the > >>> example queries you have? Can you say anything about the performance > >>> of the different underlying execution engines? > > Recently I have a plan on benchmark for performance of new Hama > release. I might be able to generate some comparison table bt spark, > hama, flink. > > On Fri, Aug 29, 2014 at 12:13 AM, Leonidas Fegaras <[email protected]> > wrote: > > I neglected to mentioned that this is still work in progress (!). It has > all > > the necessary parts to work with Flink but still has bugs and obviously > > needs lots of performance tuning. The reason I announced it early is to > get > > feedback and hopefully bug reports from the dev@flink. But I must say > you > > already gave me a lot of encouragement. Thanks! > > The major component missing in this system is to work with HDFS on > > distributed mode by default. Now, it uses the local file system (which is > > NFS shared by workers) on both local and distributed mode, which is > terribly > > inefficient. For local mode, I want to have the local working directory > as > > the default for relative paths (I think this works OK). For distributed > > mode, I want the HDFS and the user home on HDFS to be the default. I will > > try to fix this and have a workable system for Yarn by the end of this > > weekend. The local mode works fine now, I think. > > It was easy to port the MRQL physical operators to Flink DataSet > methods; I > > have done something similar for Spark. The components that took me long > to > > develop were the DataSources and the DataSinks. All the other MRQL > backends > > use the hadoop HDFS. So I had to copy some of my files from my core > system > > that uses HDFS to the Flink backend, change their names, and use the > Flink > > filesystem packages (which are very similar to Hadoop HDFS). Another > problem > > was that I had heavily used Hadoop Sequential files to store results for > the > > other backends. So I had to switch to Flink's BinaryOutputFormat. The > > DataSinks in Flink are not very convenient. I wish there was a DataSink > that > > contains an Iterator so that we can use the results for purposes other > than > > storing them in files. Also, compared to Spark, there are very few ways > to > > send results from workers to the master node after execution. Custom > > aggregators still have a bug when the aggregation result is a custom > class > > (it's a serialization problem: the class of the deserialized result > doesn't > > match the expected class, although they have the same name). In general, > I > > encountered some problems with serialization: sometimes I couldn't use > inner > > classes for the Flink functional parameters and I had to define them as > > static classes. Another thing that took me a couple of days to fix was to > > dump data from an Iterator to a Flink Binary file. Dumping the iterator > data > > into a vector first was not feasible because these data may be huge. > First, > > I tried to use the fromCollection method, but it required that the > Iterator > > be serializable (It doesn't make sense; how do you make an Iterator > > serializable?) Then I used the following hack: > > > > BinaryOutputFormat of = new BinaryOutputFormat(); > > of.setOutputFilePath(path); > > of.open(0,2); > > ... > > It took me a while to find that I need to put of.open(0,2) instead of > > of.open(0,1). Why do we need 2 tasks? > > So, thanks for your encouragement. I will try to fix some of these bugs > by > > Monday and have a system that performs well on Yarn. > > Leonidas > > > > > > On 08/28/2014 03:58 AM, Fabian Hueske wrote: > >> > >> That's really cool! > >> > >> I'm also curious about your experience with Flink. Did you find major > >> obstacles that you needed to overcome for the integration? > >> Is there some write-up / report available somewhere (maybe in JIRA) that > >> discusses the integration? Are you using Flink's full operator set or do > >> you compile everything into Map and Reduce? > >> > >> Best, Fabian > >> > >> > >> 2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <[email protected]>: > >> > >>> Very nice indeed! How well is this tested? Can it already run all the > >>> example queries you have? Can you say anything about the performance > >>> of the different underlying execution engines? > >>> > >>> On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <[email protected]> > wrote: > >>>> > >>>> Wow, that is impressive! > >>>> > >>>> > >>>> On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <[email protected]> wrote: > >>>> > >>>>> Awesome, indeed! Looking forward to trying it out. :) > >>>>> > >>>>> > >>>>> On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <[email protected] > > > >>>>> wrote: > >>>>> > >>>>>> Awesome! > >>>>>> > >>>>>> > >>>>>> 2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <[email protected]>: > >>>>>> > >>>>>> > >>>>>>> Hello, > >>>>>>> I would like to let you know that Apache MRQL can now run queries > on > >>>>>> > >>>>>> Flink. > >>>>>>> > >>>>>>> MRQL is a query processing and optimization system for large-scale, > >>>>>>> distributed data analysis, built on top of Apache > Hadoop/map-reduce, > >>>>>>> Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL. > >>>>>>> They can work on complex, user-defined data (such as JSON and XML) > >>> > >>> and > >>>>>>> > >>>>>>> can express complex queries (such as pagerank and matrix > >>>>> > >>>>> factorization). > >>>>>>> > >>>>>>> MRQL on Flink has been tested on local mode and on a small Yarn > >>>>> > >>>>> cluster. > >>>>>>> > >>>>>>> Here are the directions on how to build the latest MRQL snapshot: > >>>>>>> > >>>>>>> git clone > >>> > >>> https://git-wip-us.apache.org/repos/asf/incubator-mrql.git > >>>>>> > >>>>>> mrql > >>>>>>> > >>>>>>> cd mrql > >>>>>>> mvn -Pyarn clean install > >>>>>>> > >>>>>>> To make it run on your cluster, edit conf/mrql-env.sh and set the > >>>>>>> Java, the Hadoop, and the Flink installation directories. > >>>>>>> > >>>>>>> Here is how to run PageRank. First, you need to generate a random > >>>>>>> graph and store it in a file using the MRQL query RMAT.mrql: > >>>>>>> > >>>>>>> bin/mrql.flink -local queries/RMAT.mrql 1000 10000 > >>>>>>> > >>>>>>> This will create a graph with 1K nodes and 10K edges using the RMAT > >>>>>>> algorithm, will remove duplicate edges, and will store the graph in > >>>>>>> the binary file graph.bin. Then, run PageRank on Flink mode using: > >>>>>>> > >>>>>>> bin/mrql.flink -local queries/pagerank.mrql > >>>>>>> > >>>>>>> To run MRQL/Flink on a Yarn cluster, first start the Flink > container > >>>>>>> on Yarn by running the script yarn-session.sh, such as: > >>>>>>> > >>>>>>> ${FLINK_HOME}/bin/yarn-session.sh -n 8 > >>>>>>> > >>>>>>> This will print the name of the Flink JobManager, which can be used > >>> > >>> in: > >>>>>>> > >>>>>>> export FLINK_MASTER=name-of-the-Flink-JobManager > >>>>>>> bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000 > >>>>>>> > >>>>>>> This will create a graph with 1M nodes and 10M edges using RMAT on > >>> > >>> 16 > >>>>>>> > >>>>>>> nodes (slaves). You can adjust these numbers to fit your cluster. > >>>>>>> Then, run PageRank using: > >>>>>>> > >>>>>>> bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql > >>>>>>> > >>>>>>> The MRQL project page is at: http://mrql.incubator.apache.org/ > >>>>>>> > >>>>>>> Let me know if you have any questions. > >>>>>>> Leonidas Fegaras > >>>>>>> > >>>>>>> > > > > > > -- > Best Regards, Edward J. Yoon > CEO at DataSayer Co., Ltd. >
