[ 
https://issues.apache.org/jira/browse/MRQL-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leonidas Fegaras resolved MRQL-63.
----------------------------------
    Resolution: Implemented

The patch was applied to GIT master.

> Add support for MRQL streaming in spark streaming mode
> ------------------------------------------------------
>
>                 Key: MRQL-63
>                 URL: https://issues.apache.org/jira/browse/MRQL-63
>             Project: MRQL
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 0.9.4
>            Reporter: Leonidas Fegaras
>            Assignee: Leonidas Fegaras
>         Attachments: MRQL-63.patch
>
>
> This patch introduces a major extension to MRQL, called MRQL streaming.
> We can now run continuous MRQL queries on streams of data.
> Currently, it works on Spark Streaming only but we may add support for Flink 
> Streaming and/or Storm in the future.
> It has been tested in Spark local mode and in Spark distributed mode on a 
> Yarn cluster.
> MRQL now supports window-based streaming based on a sliding window during a 
> certain time interval. To support MRQL streaming, you need to add the 
> parameter "-stream t" to the mrql command, where t is the time interval in 
> milliseconds. Then MRQL will processes the new batch of data in the input 
> streams every t milliseconds.
> A stream source in MRQL takes the form stream(...), which has the same 
> parameters as the source(...) form. For example:
> {code:SQL}
> select (k,avg(p.Y))
> from p in stream(binary,"tmp/points.bin")
> group by k: p.X;
> {code}
> This query process all sequence files in the directory tmp/points.bin and 
> then checks this directory every t milliseconds for new files. When a new 
> file is inserted in the directory (or if the modification time of an existing 
> file changes), it processes the new files. One may work on multiple files and 
> the query may contain both stream and regular data sources. If there is at 
> least one stream source, the query becomes continuous (never stops). One may 
> dump the output stream to binary or CVS files using the existing MRQL syntax:
> {code:SQL}
> store "tmp/out" from e
> {code}
> This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, 
> ... etc.
> Example for testing:
> First create data:
> {quote}
> mrql.spark -local queries/points.mrql 100
> {quote}
> Then run the continuous query:
> {quote}
> mrql.spark -local -stream 1000 queries/streaming.mrql
> {quote}
> On a separate terminal, you can type:
> {quote}
> touch tmp/points.bin/part-00000
> {quote}
> to process a new batch of data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to