[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150540#comment-14150540
]
Raul Castro Fernandez commented on SAMZA-390:
---------------------------------------------
Cool, so it seems this converges towards the SQL-like solution.
I think that this is a good opportunity to enrich the data model. It is
important to understand why people like SQL and keep those features, however,
there are other things that can change. I think the data model is probably one
thing you can change without annoying people too much.
This is what I was thinking:
One idea is to remove the window semantics from the query and instead pushing
it to the data model. Instead of specifying the window in a query, you would be
actually queriyng a special table/stream that represents the window (think
about it as some sort of materialized view). For users, I believe this is
natural: all data is accessed the same way. In particular, they do not need to
worry about specifying windows in the query, only in the data model.
>From a system implementation, performance and usability point of view, these
>are some advantages that come to my mind:
- Easier to reason about windows. There is opportunity to remove the
constraints of the Oracle-vs-Streambase model explained above. When I say
remove, I might mean hide it, not sure yet.
- Performance. If someone defines a huge window, all data in the window needs
to live in the memory of the node processing such window. This can be wasteful
or impossible, requiring materializing to disk partial data and all sort of
ugly things (I started discussing some of these problems with [~criccomini]).
Instead, I think by leaving this responsibility to the underlying storage, you
can get rid of this problem, by evaluating the window lazily, once it's full.
This seems to make more difficult processing windows incrementally. I think it
is better to have this problem than having the problem of: I don't know if all
data will fit in the memory of my machine.
- Debugging. While debugging your query, you can ask the engine to save all
windows you are processing (remember windows are now just another data
structure, such as a table, etc). That way one should be able to just reply
them, factoring out time, which in general is quite difficult.
In general, all this would be about defining a robust and richer data model,
i.e. not only tables, but also streams, windows, and probably some other stuff
to enable incremental processing. What I mean is that users would still have a
SQL-like interface. This probably has some limitations, but I just wanted to
keep the ball rolling.
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Reporter: Raul Castro Fernandez
> Priority: Minor
> Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)