[ https://issues.apache.org/jira/browse/MRQL-63?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Leonidas Fegaras resolved MRQL-63. ---------------------------------- Resolution: Implemented The patch was applied to GIT master. > Add support for MRQL streaming in spark streaming mode > ------------------------------------------------------ > > Key: MRQL-63 > URL: https://issues.apache.org/jira/browse/MRQL-63 > Project: MRQL > Issue Type: Improvement > Components: Streaming > Affects Versions: 0.9.4 > Reporter: Leonidas Fegaras > Assignee: Leonidas Fegaras > Attachments: MRQL-63.patch > > > This patch introduces a major extension to MRQL, called MRQL streaming. > We can now run continuous MRQL queries on streams of data. > Currently, it works on Spark Streaming only but we may add support for Flink > Streaming and/or Storm in the future. > It has been tested in Spark local mode and in Spark distributed mode on a > Yarn cluster. > MRQL now supports window-based streaming based on a sliding window during a > certain time interval. To support MRQL streaming, you need to add the > parameter "-stream t" to the mrql command, where t is the time interval in > milliseconds. Then MRQL will processes the new batch of data in the input > streams every t milliseconds. > A stream source in MRQL takes the form stream(...), which has the same > parameters as the source(...) form. For example: > {code:SQL} > select (k,avg(p.Y)) > from p in stream(binary,"tmp/points.bin") > group by k: p.X; > {code} > This query process all sequence files in the directory tmp/points.bin and > then checks this directory every t milliseconds for new files. When a new > file is inserted in the directory (or if the modification time of an existing > file changes), it processes the new files. One may work on multiple files and > the query may contain both stream and regular data sources. If there is at > least one stream source, the query becomes continuous (never stops). One may > dump the output stream to binary or CVS files using the existing MRQL syntax: > {code:SQL} > store "tmp/out" from e > {code} > This dumps the output of the continuous query e into tmp/out/f1, tmp/out/f2, > ... etc. > Example for testing: > First create data: > {quote} > mrql.spark -local queries/points.mrql 100 > {quote} > Then run the continuous query: > {quote} > mrql.spark -local -stream 1000 queries/streaming.mrql > {quote} > On a separate terminal, you can type: > {quote} > touch tmp/points.bin/part-00000 > {quote} > to process a new batch of data. -- This message was sent by Atlassian JIRA (v6.3.4#6332)