MRQL on Flink

Leonidas Fegaras Wed, 27 Aug 2014 13:50:32 -0700

Hello,
I would like to let you know that Apache MRQL can now run queries on Flink.
MRQL is a query processing and optimization system for large-scale,
distributed data analysis, built on top of Apache Hadoop/map-reduce,
Hama, Spark, and now Flink. MRQL queries are SQL-like but not SQL.
They can work on complex, user-defined data (such as JSON and XML) and
can express complex queries (such as pagerank and matrix factorization).


MRQL on Flink has been tested on local mode and on a small Yarn cluster.

Here are the directions on how to build the latest MRQL snapshot:

git clone https://git-wip-us.apache.org/repos/asf/incubator-mrql.git mrql
cd mrql
mvn -Pyarn clean install

To make it run on your cluster, edit conf/mrql-env.sh and set the
Java, the Hadoop, and the Flink installation directories.

Here is how to run PageRank. First, you need to generate a random
graph and store it in a file using the MRQL query RMAT.mrql:

bin/mrql.flink -local queries/RMAT.mrql 1000 10000

This will create a graph with 1K nodes and 10K edges using the RMAT
algorithm, will remove duplicate edges, and will store the graph in
the binary file graph.bin. Then, run PageRank on Flink mode using:

bin/mrql.flink -local queries/pagerank.mrql

To run MRQL/Flink on a Yarn cluster, first start the Flink container
on Yarn by running the script yarn-session.sh, such as:

${FLINK_HOME}/bin/yarn-session.sh -n 8

This will print the name of the Flink JobManager, which can be used in:

export FLINK_MASTER=name-of-the-Flink-JobManager
bin/mrql.flink -dist -nodes 16 queries/RMAT.mrql 1000000 10000000

This will create a graph with 1M nodes and 10M edges using RMAT on 16
nodes (slaves). You can adjust these numbers to fit your cluster.
Then, run PageRank using:

bin/mrql.flink -dist -nodes 16 queries/pagerank.mrql

The MRQL project page is at: http://mrql.incubator.apache.org/

Let me know if you have any questions.
Leonidas Fegaras

MRQL on Flink

Reply via email to