[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14225130#comment-14225130
]
Milinda Lakmal Pathirage commented on SAMZA-390:
------------------------------------------------
Below are my thoughts about several things discussed above.
1. Lack of tuple based sliding windows in CQL: Tuple based sliding windows are
there in CQL. In addition to tuple and time based windows it introduce
partitioned window where we partition the stream during window creation.
2. If we are going for SQL like semantics, it doesn't matter whether we decide
to go with embedded DSL or plain SQL. We can decouple operator layer and the
language layer. That's how I have designed Freshet mentioned in earlier
comment. In my implementation I have several Samza tasks which implements
different operators such as window, select, project and aggregate. what DSL
layer does is generation of relation algebra like expression which will be
converted to Samza job graph. Each individual node in this job graph is a Samza
job with its properties such as input stream, output stream and other operator
specific configuration parameters. In the final phase, these nodes get
converted into properties file which describes the Samza job. IMHO, we should
first decide what type of semantics we are going to support and then design the
operator layer based on this semantic.
3. I am not sure whether we should add concept of time to Samza. We can
implement the concept of time in our operator layer and query layer, without
integrating concept of time to Samza. Currently in Freshet [1], I don't have
the time concept implemented. But I have designed the DSL in a way that, query
writer can specify how to timestamp individual tuples. For example timestamp
can be a field in tuple for some scenarios. So in Freshet DSL, user can specify
which field contains the timestamp. Otherwise, Freshet uses time when tuple
first appeared in Freshet as tuple's timestamp.
4. I am nor sure I completely understood [~raulcf]'s comment about moving the
concept of windows out of the query. But if we are doing that, what we can do
alternatively is keep the concept of windows in the query but implement the
operator layer/physical query plan in a way that it separate out the windowing
and query logic. In my experience with CQL, the same thing happens in CQL up to
some extent. Most/all scenarios, what happens first in CQL query is window
generation as insert/delete stream. Insert/delete stream assumes each tuple can
be uniquely identified. When tuple is added to the window, it get emitted to
output stream as a 'insert' tuple and when tuple get removed from window, it
get emitter to output stream as a 'delete' tuple. Downstream operator should
handle this insert/delete tuples according to its logic. I found this concept
of insert/delete stream really simplifies the window handling. "CQL
Implementation in STREAM" section in [2] contains some interesting details
about stream query execution and how sliding windows can be implemented as I
described above.
[1] https://github.com/milinda/Freshet
[2] https://cs.uwaterloo.ca/~david/cs848/stream-cql.pdf
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Reporter: Raul Castro Fernandez
> Priority: Minor
> Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)