Re: MRQL on Flink

Leonidas Fegaras Thu, 28 Aug 2014 08:45:29 -0700

Thanks,

Yes, it will be beneficial for both projects to cross-link. We may needto wait for an announcement until I make the system more stable.I forgot to mention that having a query processor on top of Flink whichdoesn't use the Flink optimizer much may be a bit unfair to Flink (whenwe compare compare Flink to Spark). Flink shines best when the data arerelational and we use its special relational methods. MRQL uses customdata only, which doesn't leave many opportunities to the Flinkoptimizer. Nevertheless, I may improve the MRQL Flink evaluator in thefuture to recognize cases when the data is flat so it can call therelational Flink methods, instead of the generic methods. This willrequire lots of work.I used multiple jars when I used a flink-snapshot but then laterswitched to uberjar after the first Flink release. I will make it tosupport both.The MapReduce in the plan is not the Hadoop map-reduce; it's a physicalplan operator whose functionality is equivalent to the hadoopmap-reduce. It can easily translated into Flink flatMap with a groupBy.

Leonidas


On 08/28/2014 07:32 AM, Robert Metzger wrote:

Amazing.
In my opinion, we should cross-link our projects on the websites. Maybe we
should add a section on our website where we list projects we depend on and
projects depending on us.
A little blog post / news on our website (once a MRQL release with Flink
support is out) can also draw some attention to this great work!

I've tried following your instructions and found one issue with Java 8 on
the way: https://issues.apache.org/jira/browse/MRQL-46
I think the classpath setup of the mrql scripts assumes that the user has a
flink yarn uberjar file (one fat-jar with everything). I've first tried it
with a regular "hadoop2" build of flink.
We should probably generalize the classpath setup there a bit (to include
all "flink-" prefixed jar files into the classpath).

After I've sorted out these issues, mrql was working.
Is the local mode actually using Flink's local execution?
The output said:
Apache MRQL version 0.9.4 (interpreted local Flink mode using 2 tasks)
Query type: ( int, int, int, int ) -> ( int, int )
Query type: !bag(( int, int ))
Physical plan:
MapReduce:
    input: Generator

In particular the "MapReduce" there was confusing me.
I hope to find some more time soon to look closer into the MRQL query
language.

Robert



On Thu, Aug 28, 2014 at 10:58 AM, Fabian Hueske <[email protected]> wrote:

That's really cool!

I'm also curious about your experience with Flink. Did you find major
obstacles that you needed to overcome for the integration?
Is there some write-up / report available somewhere (maybe in JIRA) that
discusses the integration? Are you using Flink's full operator set or do
you compile everything into Map and Reduce?

Best, Fabian


2014-08-28 7:37 GMT+02:00 Aljoscha Krettek <[email protected]>:

Very nice indeed! How well is this tested? Can it already run all the
example queries you have? Can you say anything about the performance
of the different underlying execution engines?

On Thu, Aug 28, 2014 at 12:58 AM, Stephan Ewen <[email protected]> wrote:

Wow, that is impressive!


On Thu, Aug 28, 2014 at 12:06 AM, Ufuk Celebi <[email protected]> wrote:

Awesome, indeed! Looking forward to trying it out. :)


On Wed, Aug 27, 2014 at 10:52 PM, Sebastian Schelter <[email protected]>
wrote:

Awesome!


2014-08-27 13:49 GMT-07:00 Leonidas Fegaras <[email protected]>:

Hello,
I would like to let you know that Apache MRQL can now run queries

on

Flink.

MRQL is a query processing and optimization system for

large-scale,

distributed data analysis, built on top of Apache

Hadoop/map-reduce,

Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
They can work on complex, user-defined data (such as JSON and XML)

and

can express complex queries (such as pagerank and matrix

factorization).

MRQL on Flink has been tested on local mode and on a small Yarn

cluster.

Here are the directions on how to build the latest MRQL snapshot:

git clone

https://git-wip-us.apache.org/repos/asf/incubator-mrql.git

mrql

cd mrql
mvn -Pyarn clean install

To make it run on your cluster, edit conf/mrql-env.sh and set the
Java, the Hadoop, and the Flink installation directories.

Here is how to run PageRank. First, you need to generate a random
graph and store it in a file using the MRQL query RMAT.mrql:

bin/mrql.flink -local queries/RMAT.mrql 1000 10000

This will create a graph with 1K nodes and 10K edges using the

RMAT

algorithm, will remove duplicate edges, and will store the graph

in

the binary file graph.bin. Then, run PageRank on Flink mode using:

bin/mrql.flink -local queries/pagerank.mrql

To run MRQL/Flink on a Yarn cluster, first start the Flink

container

on Yarn by running the script yarn-session.sh, such as:

${FLINK_HOME}/bin/yarn-session.sh -n 8

This will print the name of the Flink JobManager, which can be

used

in:

export FLINK_MASTER=name-of-the-Flink-JobManager
bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000

This will create a graph with 1M nodes and 10M edges using RMAT on

nodes (slaves). You can adjust these numbers to fit your cluster.
Then, run PageRank using:

bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql

The MRQL project page is at: http://mrql.incubator.apache.org/

Let me know if you have any questions.
Leonidas Fegaras

Re: MRQL on Flink

Reply via email to