[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Milinda Lakmal Pathirage (JIRA) Tue, 25 Nov 2014 12:10:11 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225130#comment-14225130
 ]


Milinda Lakmal Pathirage commented on SAMZA-390:
------------------------------------------------

Below are my thoughts about several things discussed above.

1. Lack of tuple based sliding windows in CQL: Tuple based sliding windows are 
there in CQL. In addition to tuple and time based windows it introduce 
partitioned window where we partition the stream during window creation.
2. If we are going for SQL like semantics, it doesn't matter whether we decide 
to go with embedded DSL or plain SQL. We can decouple operator layer and the 
language layer. That's how I have designed Freshet mentioned in earlier 
comment. In my implementation I have several Samza tasks which implements 
different operators such as window, select, project and aggregate. what DSL 
layer does is generation of relation algebra like expression which will be 
converted to Samza job graph. Each individual node in this job graph is a Samza 
job with its properties such as input stream, output stream and other operator 
specific configuration parameters. In the final phase, these nodes get 
converted into properties file which describes the Samza job. IMHO, we should 
first decide what type of semantics we are going to support and then design the 
operator layer based on this semantic. 
3. I am not sure whether we should add concept of time to Samza. We can 
implement the concept of time in our operator layer and query layer, without 
integrating concept of time to Samza. Currently in Freshet [1], I don't have 
the time concept implemented. But I have designed the DSL in a way that, query 
writer can specify how to timestamp individual tuples. For example timestamp 
can be a field in tuple for some scenarios. So in Freshet DSL, user can specify 
which field contains the timestamp. Otherwise, Freshet uses time when tuple 
first appeared in Freshet as tuple's timestamp.
4. I am nor sure I completely understood  [~raulcf]'s comment about moving the 
concept of windows out of the query. But if we are doing that, what we can do 
alternatively is keep the concept of windows in the query but implement the 
operator layer/physical query plan in a way that it separate out the windowing 
and query logic. In my experience with CQL, the same thing happens in CQL up to 
some extent. Most/all scenarios, what happens first in CQL query is window 
generation as insert/delete stream. Insert/delete stream assumes each tuple can 
be uniquely identified. When tuple is added to the window, it get emitted to 
output stream as a 'insert' tuple and when tuple get removed from window, it 
get emitter to output stream as a 'delete' tuple. Downstream operator should 
handle this insert/delete tuples according to its logic. I found this concept 
of insert/delete stream really simplifies the window handling. "CQL 
Implementation in STREAM" section in [2] contains some interesting details 
about stream query execution and how sliding windows can be implemented as I 
described above. 

[1] https://github.com/milinda/Freshet
[2] https://cs.uwaterloo.ca/~david/cs848/stream-cql.pdf

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to