[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Raul Castro Fernandez (JIRA) Sat, 27 Sep 2014 04:21:07 -0700

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150540#comment-14150540
 ]


Raul Castro Fernandez commented on SAMZA-390:
---------------------------------------------

Cool, so it seems this converges towards the SQL-like solution.

I think that this is a good opportunity to enrich the data model. It is 
important to understand why people like SQL and keep those features, however, 
there are other things that can change. I think the data model is probably one 
thing you can change without annoying people too much.

This is what I was thinking:
One idea is to remove the window semantics from the query and instead pushing 
it to the data model. Instead of specifying the window in a query, you would be 
actually queriyng a special table/stream that represents the window (think 
about it as some sort of materialized view). For users, I believe this is 
natural: all data is accessed the same way. In particular, they do not need to 
worry about specifying windows in the query, only in the data model.

>From a system implementation, performance and usability point of view, these 
>are some advantages that come to my mind:

- Easier to reason about windows. There is opportunity to remove the 
constraints of the Oracle-vs-Streambase model explained above. When I say 
remove, I might mean hide it, not sure yet.
- Performance. If someone defines a huge window, all data in the window needs 
to live in the memory of the node processing such window. This can be wasteful 
or impossible, requiring materializing to disk partial data and all sort of 
ugly things (I started discussing some of these problems with [~criccomini]). 
Instead, I think by leaving this responsibility to the underlying storage, you 
can get rid of this problem, by evaluating the window lazily, once it's full. 
This seems to make more difficult processing windows incrementally. I think it 
is better to have this problem than having the problem of: I don't know if all 
data will fit in the memory of my machine.
- Debugging. While debugging your query, you can ask the engine to save all 
windows you are processing (remember windows are now just another data 
structure, such as a table, etc). That way one should be able to just reply 
them, factoring out time, which in general is quite difficult.

In general, all this would be about defining a robust and richer data model, 
i.e. not only tables, but also streams, windows, and probably some other stuff 
to enable incremental processing. What I mean is that users would still have a 
SQL-like interface. This probably has some limitations, but I just wanted to 
keep the ball rolling.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to