[ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14107285#comment-14107285
 ] 

Raul Castro Fernandez commented on SAMZA-390:
---------------------------------------------

Wrapping up early discussions with Chris about high-level languages for Samza:

Some options for High-Level Languages:
----------------------------------------------------
We discussed about two main options to write a high-level language for Samza:
- SQL-like
- Pig/Imperative style

Following I quickly summarize both approaches:

#SQL-like

SQL is intended to primarily access static data. Because Samza reads and 
processes streams, and these are infinite, SQL is not enough. To process 
streams, there are mainly two families of streaming-SQL versions, these are CQL 
and Streambase. The fundamental differences between these two are related to 
their window semantics:

###CQL
In this model, tuples (events, messages) are assumed to have a timestamp. The 
processing engine thinks of windows as time-based windows, that is: "Count(A.b) 
in a window of 30sec", where the window will be defined according to the 
timestamp included in the events.

###StreamSQL
In this model, tuples(events, messages) also include a timestamp. However, this 
model is more suitable for tuple-based windows, of the form: "Count(A.b) in a 
window of 1000 tuples".

I think it is important to make the distinction, because that basically changes 
how one should think about processing streams when windows are required. There 
is a good paper on this, that explains the differences between both models in 
detail, and that I think would be worth reading to decide on one, once there 
are clear use cases for Samza:

"Towards a Streaming SQL Standard": http://cs.brown.edu/~ugur/streamsql.pdf

#Pig/Imperative style

In this case, the idea would be to offer an interface similar to Pig. An 
imperative style way of writing transformations on streams. This model could 
include a fixed set of operators of different categories. There would be 
operators to extract/read data from the underlying system (e.g. kafka topics), 
stateless operators to perform filtering/map/etc, and there could be stateful 
operators or stateful constructs (these are basic operators that take 2 inputs, 
the input stream and the state) that would operate on state defined by users.

Implementation:
---------------------

#SQL

A declarative representation of a query would demand 3 things:

1- Parse the query
2- Build an execution tree (that would roughly map to the dataflow graph)
3- Optimize (rewriting the query to preserve the semantics but make it cheaper)

Apache Optiq gives these three, however there are a bunch of things that should 
be addressed before anticipating that it is a solution:

1- Parse the query
Queries for Samza would include windows, therefore, the parser would need to 
understand window semantics as well
2- Execution tree
This would probably stay more or less the same
3- Optimization
This changes a lot. First of all, windows introduce new challenges when one 
wants to rewrite the query, as pushing an operator upstream can change the 
semantics, for example. There are a bunch of theory and more practical oriented 
papers on this, but it seems like a big problem. All I am saying is that before 
even start thinking on optimizing this, it would be good to have a clear 
understanding of the requirements. On top of that, the nature of stream 
processing may lead to network bottlenecks, which would also need to be 
included in the optimization algorithm. And if you start thinking about 
joins... very interesting!

#Imperative style

In this case it would be needed to write a language and a parser and compiler 
for such language. This alone is a bunch of work. There is one alternative, 
writing a DSL. By using a DSL, we get rid of many of the problems of the 
parser/optimizer. One could think that the problem with DSLs is the efficiency, 
but this is not that important in this case, where we will have an 
"optimization" phase anyway.

So these are the two options that I see for DSL:

##Internal DSL
Meaning that it will basically be part of the parent language, i.e. a library. 
This is good for fast prototyping and trying new operators, etc. I think it can 
easily be used by building the appropriate tooling around the DSL.

##External DSL
In this case they have their own syntax. This means that they require a 
lexer/parser. Luckily enough, there are a bunch of tools to fix this. There is 
the YACC, javaCC faimily of tools for JVM languages. There is also a more 
interesting concept (never tried tough) in Scala, (with Parser Combinators).

So I guess, that one first good step towards this would be to really think 
which one of the two abstractions is preferable, i.e. stream-SQL or 
imperative-style DSL. Note that regardless, once there is a vertical solution 
for one of these---from query to run---it will be possible to write one for the 
other case.


> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to